Currently, ForgeLLM implements only Continued Pre-Training (CPT), which is the first stage of domain adaptation:
CPT Capabilities:
- Purpose: Adapt base models to domain-specific knowledge (Mnemosyne dataset)
- Training Type: Full fine-tuning or LoRA on raw text data
- Data Format: Plain markdown documents processed into chunks
- Implementation:
forgellm/training/trainer.pywithContinuedPretrainerclass - Status: ✅ Fully Functional - Successfully trains models on custom datasets
- Output: Domain-adapted models that understand Mnemosyne concepts but lack instruction-following capabilities
The second critical stage - Instruction Fine-Tuning - is currently missing:
IFT Requirements:
- Purpose: Transform CPT models into conversational assistants that follow instructions
- Training Type: Fine-tuning on conversation pairs (user prompts + assistant responses)
- Data Format: Structured conversations with proper chat templates
- Implementation:
⚠️ Incomplete -forgellm/training/instruction_tuner.pyexists but has critical issues - Status: ❌ Non-Functional - Broken imports, wrong data format expectations, unused code
# CLI imports non-existent class
from .training.instruction_tuner import InstructionTuner # ❌ Does not exist
# CLI calls non-existent method
tuner.run_tuning() # ❌ Should be run_complete_pipeline()The current ConversationExtractor expects XML format that doesn't exist in our dataset:
<!-- Expected but NOT in our dataset -->
<conversation>
<message from="user">Question</message>
<message from="assistant">Answer</message>
</conversation>Reality: Our dataset contains plain markdown with no conversation structure.
- No integration with existing model management
- No web interface support
- No proper configuration system
- No testing or validation
🎯 ModelArchitectureManager System: We already have a sophisticated architecture detection and formatting system:
- Location:
forgellm/utils/model_architectures.py+model_architectures.json - Capabilities: Auto-detects 10+ model families (Llama, Qwen, Gemma, Mistral, Phi, etc.)
- Chat Formatting: Proper templates for each architecture with special case handling
- Integration: Already used in server and CLI for inference
- Quality: Well-tested and handles edge cases (e.g., Gemma system message workarounds)
Key Methods Available:
detect_architecture(model_name)- Auto-detect model familyformat_messages(messages, model_name)- Apply proper chat templatesis_instruct_model(model_name)- Determine if model expects instruction formatget_architecture_info(arch)- Get detailed architecture information
The external script instruct_tuning.py provides a complete, research-based IFT implementation that we should adapt:
# Data mixture that prevents catastrophic forgetting
data_mixture = {
"oasst1": 31.5%, # General instruction following
"openorca": 27%, # Question answering
"dolly": 18%, # Task completion
"alpaca": 13.5%, # Instruction following
"mnemosyne": 10% # Domain preservation
}# Generic conversation extractor with configurable participant patterns
class FlexibleConversationExtractor:
def __init__(self, participant_patterns: Dict[str, List[str]]):
"""
Args:
participant_patterns: Dict mapping roles to patterns/regex
Example: {
"user": ["**Laurent-Philippe**:", r"\*\*User\*\*:", "Human:"],
"assistant": ["**Mnemosyne**:", "**Assistant**:", "AI:"]
}
"""
self.participant_patterns = participant_patterns
def extract_conversations_from_file(self, file_path):
# Parse using configurable patterns (string matching or regex)
# Convert to proper conversation format
# Preserve domain knowledge
# Support multiple participant identification methodsKey Improvement: Instead of hardcoding **Laurent-Philippe**, use configurable patterns that can be:
- Simple string patterns:
"**Laurent-Philippe**:" - Regex patterns:
r"\*\*Laurent-Philippe.*?:" - Multiple alternatives per role:
["**User**:", "**Human**:", "User:"] - Flexible role mapping:
{"user": [...], "assistant": [...], "system": [...]}
# Use existing ModelArchitectureManager for proper chat formatting
from forgellm.utils.model_architectures import get_model_architecture_manager
def format_conversations_for_training(conversations, base_model_name):
"""Format conversations using existing architecture system"""
arch_manager = get_model_architecture_manager()
formatted_examples = []
for conversation in conversations:
# Convert conversation to messages format
messages = conversation["messages"]
# Use existing architecture detection and formatting
formatted_text = arch_manager.format_messages(messages, base_model_name)
formatted_examples.append({"text": formatted_text})
return formatted_examplesKey Advantage: We already have a comprehensive ModelArchitectureManager that:
- Auto-detects architectures: Llama, Qwen, Gemma, Mistral, Phi, etc.
- Handles chat formatting: Proper templates for each model family
- Supports system messages: Including special handling for Gemma (system as assistant)
- Is already integrated: Used in server and CLI for inference
- Is well-tested: Proven to work with our model loading system
- Supports 10+ architectures: Including edge cases and fallbacks
# Prevent catastrophic forgetting of CPT knowledge
config = {
"learning_rate": 5e-6, # Conservative LR
"batch_size": 4, # Stable training
"max_iterations": 100, # Quick adaptation
"lora_layers": 16, # Parameter efficiency
"mnemosyne_ratio": 0.1 # Domain preservation
}# Fix import issues
class InstructionTuner: # Rename from SOTAInstructTrainer
def run_tuning(self): # Add missing method
return self.run_complete_pipeline()# Replace XML-based extractor with configurable conversation parser
class FlexibleConversationExtractor:
"""Extract conversations using configurable participant patterns"""
def __init__(self, config: Dict[str, Any]):
"""
Args:
config: Configuration with participant patterns
Example: {
"participant_patterns": {
"user": ["**Laurent-Philippe**:", r"\*\*User\d*\*\*:"],
"assistant": ["**Mnemosyne**:", "**Assistant**:"]
},
"conversation_sources": ["verbatims/2025", "conversations"],
"quality_filters": {...}
}
"""
self.config = config
def extract_conversations(self, input_dirs: List[str]):
# Parse using configurable patterns
# Support both string matching and regex
# Convert to instruction format with proper role mapping
# Apply quality filtering
# Return structured conversationsIntegration with ModelArchitectureManager:
# Use existing architecture system for formatting
def format_extracted_conversations(conversations, base_model_name):
arch_manager = get_model_architecture_manager()
return [
{"text": arch_manager.format_messages(conv["messages"], base_model_name)}
for conv in conversations
]# Add HuggingFace dataset integration
def load_general_instruction_data(self):
# Load OASST1, OpenOrca, Dolly, Alpaca
# Apply proper data mixture ratios
# Format for model-specific chat templates- Extend training interface to support IFT after CPT
- Add IFT-specific configuration options
- Show CPT → IFT pipeline progression
- Auto-detect base model architecture (Llama, Gemma, Qwen)
- Apply correct chat formatting
- Validate model compatibility
- Extend dashboard to show IFT metrics
- Track instruction-following capability
- Monitor domain knowledge retention
- Conversation quality filtering
- Response length validation
- Domain alignment scoring
- Instruction-following benchmarks
- Domain knowledge retention tests
- Conversational capability assessment
- Automatic CPT → IFT progression
- Optimal checkpoint selection
- Hyperparameter optimization
# Extend existing config system with flexible conversation extraction
@dataclass
class InstructTuningConfig:
# Inherit from base config
base_model_path: str # CPT checkpoint path
base_model_name: str # Original base model
# IFT-specific parameters
custom_data_ratio: float = 0.1 # Renamed from mnemosyne_ratio
max_train_examples: int = 10000
conversation_sources: List[str] = ["verbatims/2025", "conversations"]
# Flexible participant pattern configuration
participant_patterns: Dict[str, List[str]] = field(default_factory=lambda: {
"user": ["**Laurent-Philippe**:", "**User**:", "Human:"],
"assistant": ["**Mnemosyne**:", "**Assistant**:", "AI:"],
"system": ["**System**:", "System:"]
})
# Pattern matching options
use_regex: bool = True # Enable regex pattern matching
case_sensitive: bool = False # Case-sensitive pattern matching
# Conservative post-CPT settings
learning_rate: float = 5e-6
max_iterations: int = 100
# Architecture integration (leverages existing system)
auto_detect_architecture: bool = True # Use ModelArchitectureManager# Extend ModelManager for IFT models using existing architecture detection
class ModelManager:
def load_ift_model(self, cpt_path: str, ift_adapter: str):
# Load CPT base + IFT adapter
# Auto-detect architecture using existing ModelArchitectureManager
# Apply proper chat formatting based on detected architecture
# Enable instruction mode with appropriate templates
def format_for_instruction(self, messages: List[Dict], model_name: str):
# Use existing ModelArchitectureManager.format_messages()
# Handles all supported architectures automatically
# Includes special cases like Gemma system message handling
arch_manager = get_model_architecture_manager()
return arch_manager.format_messages(messages, model_name)# Complete CPT → IFT pipeline
def run_complete_training_pipeline():
# 1. Run CPT training
cpt_trainer = ContinuedPretrainer(cpt_config)
cpt_checkpoint = cpt_trainer.run_training()
# 2. Run IFT training on best CPT checkpoint
ift_config.base_model_path = cpt_checkpoint
ift_trainer = InstructionTuner(ift_config)
ift_model = ift_trainer.run_tuning()
return ift_model- Complete Training Pipeline: CPT → IFT progression
- Conversational Models: Domain-adapted models that follow instructions
- Web Interface Support: Full IFT training through UI
- Model Publishing: Publish both CPT and IFT models
- Quality Assurance: Proper evaluation and validation
- Instruction Following: Models respond appropriately to user prompts
- Domain Retention: Maintain Mnemosyne knowledge after IFT
- Conversation Quality: Natural, helpful responses
- Pipeline Reliability: Consistent CPT → IFT progression
Current State: ForgeLLM has excellent CPT capabilities but lacks the critical IFT stage needed for conversational AI.
Next Steps: Implement SOTA instruction fine-tuning by adapting the proven external implementation to our codebase architecture.
Priority: ⚡ High - IFT is essential for creating usable conversational models from our CPT outputs.
Timeline: Phase 1 (Fix IFT) should be completed before adding new CPT features, as IFT is the natural next step in the training pipeline.
-
ModelArchitectureManager Integration:
- No need to reimplement chat formatting
- Automatic architecture detection
- Handles 10+ model families with proven templates
- Special case handling (Gemma system messages, etc.)
-
Flexible Conversation Extraction:
- Configurable participant patterns (no hardcoded names)
- Regex support for complex patterns
- Multiple pattern alternatives per role
- Reusable across different datasets and use cases
-
Proven Infrastructure:
- Configuration system already in place
- Model management already handles adapters
- Web interface foundation exists
- Dashboard and monitoring systems ready
Phase 1 Priority: Fix the broken IFT system by:
- Replacing XML extractor with flexible pattern-based system
- Integrating ModelArchitectureManager for chat formatting
- Fixing import/class naming issues in CLI
- Leveraging existing config/model management systems
This approach maximizes code reuse and builds on proven, tested components rather than reimplementing functionality that already exists.
This document reflects the current state of training capabilities and provides a clear roadmap for implementing the missing instruction fine-tuning functionality that will complete our training pipeline.