Development Perspectives: Training Pipeline Evolution

Current State Analysis

✅ What We Have: Continued Pre-Training (CPT) Only

Currently, ForgeLLM implements only Continued Pre-Training (CPT), which is the first stage of domain adaptation:

CPT Capabilities:

Purpose: Adapt base models to domain-specific knowledge (Mnemosyne dataset)
Training Type: Full fine-tuning or LoRA on raw text data
Data Format: Plain markdown documents processed into chunks
Implementation: forgellm/training/trainer.py with ContinuedPretrainer class
Status: ✅ Fully Functional - Successfully trains models on custom datasets
Output: Domain-adapted models that understand Mnemosyne concepts but lack instruction-following capabilities

❌ What We're Missing: Instruction Fine-Tuning (IFT/SFT)

The second critical stage - Instruction Fine-Tuning - is currently missing:

IFT Requirements:

Purpose: Transform CPT models into conversational assistants that follow instructions
Training Type: Fine-tuning on conversation pairs (user prompts + assistant responses)
Data Format: Structured conversations with proper chat templates
Implementation: ⚠️ Incomplete - forgellm/training/instruction_tuner.py exists but has critical issues
Status: ❌ Non-Functional - Broken imports, wrong data format expectations, unused code

Critical Issues with Current IFT Implementation

1. Broken Architecture

# CLI imports non-existent class
from .training.instruction_tuner import InstructionTuner  # ❌ Does not exist

# CLI calls non-existent method
tuner.run_tuning()  # ❌ Should be run_complete_pipeline()

2. Wrong Data Format Expectations

The current ConversationExtractor expects XML format that doesn't exist in our dataset:

<!-- Expected but NOT in our dataset -->
<conversation>
    <message from="user">Question</message>
    <message from="assistant">Answer</message>
</conversation>

Reality: Our dataset contains plain markdown with no conversation structure.

3. Incomplete Integration

No integration with existing model management
No web interface support
No proper configuration system
No testing or validation

Solution: Implement SOTA Instruction Fine-Tuning

Existing Assets We Can Leverage

🎯 ModelArchitectureManager System: We already have a sophisticated architecture detection and formatting system:

Location: forgellm/utils/model_architectures.py + model_architectures.json
Capabilities: Auto-detects 10+ model families (Llama, Qwen, Gemma, Mistral, Phi, etc.)
Chat Formatting: Proper templates for each architecture with special case handling
Integration: Already used in server and CLI for inference
Quality: Well-tested and handles edge cases (e.g., Gemma system message workarounds)

Key Methods Available:

detect_architecture(model_name) - Auto-detect model family
format_messages(messages, model_name) - Apply proper chat templates
is_instruct_model(model_name) - Determine if model expects instruction format
get_architecture_info(arch) - Get detailed architecture information

Reference Implementation Analysis

The external script instruct_tuning.py provides a complete, research-based IFT implementation that we should adapt:

Key Features to Integrate:

1. Hybrid Data Strategy (SOTA 2024-2025)

# Data mixture that prevents catastrophic forgetting
data_mixture = {
    "oasst1": 31.5%,      # General instruction following
    "openorca": 27%,      # Question answering
    "dolly": 18%,         # Task completion
    "alpaca": 13.5%,      # Instruction following
    "mnemosyne": 10%      # Domain preservation
}

2. Flexible Conversation Extraction

# Generic conversation extractor with configurable participant patterns
class FlexibleConversationExtractor:
    def __init__(self, participant_patterns: Dict[str, List[str]]):
        """
        Args:
            participant_patterns: Dict mapping roles to patterns/regex
            Example: {
                "user": ["**Laurent-Philippe**:", r"\*\*User\*\*:", "Human:"],
                "assistant": ["**Mnemosyne**:", "**Assistant**:", "AI:"]
            }
        """
        self.participant_patterns = participant_patterns
        
    def extract_conversations_from_file(self, file_path):
        # Parse using configurable patterns (string matching or regex)
        # Convert to proper conversation format
        # Preserve domain knowledge
        # Support multiple participant identification methods

Key Improvement: Instead of hardcoding **Laurent-Philippe**, use configurable patterns that can be:

Simple string patterns: "**Laurent-Philippe**:"
Regex patterns: r"\*\*Laurent-Philippe.*?:"
Multiple alternatives per role: ["**User**:", "**Human**:", "User:"]
Flexible role mapping: {"user": [...], "assistant": [...], "system": [...]}

3. Model-Specific Chat Formatting (Leverage Existing System)

# Use existing ModelArchitectureManager for proper chat formatting
from forgellm.utils.model_architectures import get_model_architecture_manager

def format_conversations_for_training(conversations, base_model_name):
    """Format conversations using existing architecture system"""
    arch_manager = get_model_architecture_manager()
    
    formatted_examples = []
    for conversation in conversations:
        # Convert conversation to messages format
        messages = conversation["messages"]
        
        # Use existing architecture detection and formatting
        formatted_text = arch_manager.format_messages(messages, base_model_name)
        formatted_examples.append({"text": formatted_text})
    
    return formatted_examples

Key Advantage: We already have a comprehensive ModelArchitectureManager that:

Auto-detects architectures: Llama, Qwen, Gemma, Mistral, Phi, etc.
Handles chat formatting: Proper templates for each model family
Supports system messages: Including special handling for Gemma (system as assistant)
Is already integrated: Used in server and CLI for inference
Is well-tested: Proven to work with our model loading system
Supports 10+ architectures: Including edge cases and fallbacks

4. Conservative Post-CPT Hyperparameters

# Prevent catastrophic forgetting of CPT knowledge
config = {
    "learning_rate": 5e-6,     # Conservative LR
    "batch_size": 4,           # Stable training
    "max_iterations": 100,     # Quick adaptation
    "lora_layers": 16,         # Parameter efficiency
    "mnemosyne_ratio": 0.1     # Domain preservation
}

Implementation Roadmap

Phase 1: Fix Current IFT Implementation ⚡ High Priority

1.1 Rename and Fix Classes

# Fix import issues
class InstructionTuner:  # Rename from SOTAInstructTrainer
    def run_tuning(self):  # Add missing method
        return self.run_complete_pipeline()

1.2 Replace ConversationExtractor with Flexible System

# Replace XML-based extractor with configurable conversation parser
class FlexibleConversationExtractor:
    """Extract conversations using configurable participant patterns"""
    
    def __init__(self, config: Dict[str, Any]):
        """
        Args:
            config: Configuration with participant patterns
            Example: {
                "participant_patterns": {
                    "user": ["**Laurent-Philippe**:", r"\*\*User\d*\*\*:"],
                    "assistant": ["**Mnemosyne**:", "**Assistant**:"]
                },
                "conversation_sources": ["verbatims/2025", "conversations"],
                "quality_filters": {...}
            }
        """
        self.config = config
        
    def extract_conversations(self, input_dirs: List[str]):
        # Parse using configurable patterns
        # Support both string matching and regex
        # Convert to instruction format with proper role mapping
        # Apply quality filtering
        # Return structured conversations

Integration with ModelArchitectureManager:

# Use existing architecture system for formatting
def format_extracted_conversations(conversations, base_model_name):
    arch_manager = get_model_architecture_manager()
    return [
        {"text": arch_manager.format_messages(conv["messages"], base_model_name)}
        for conv in conversations
    ]

1.3 Integrate Hybrid Data Loading

# Add HuggingFace dataset integration
def load_general_instruction_data(self):
    # Load OASST1, OpenOrca, Dolly, Alpaca
    # Apply proper data mixture ratios
    # Format for model-specific chat templates

Phase 2: Web Interface Integration 🌐

2.1 Add IFT Training Option

Extend training interface to support IFT after CPT
Add IFT-specific configuration options
Show CPT → IFT pipeline progression

2.2 Model Type Detection

Auto-detect base model architecture (Llama, Gemma, Qwen)
Apply correct chat formatting
Validate model compatibility

2.3 Progress Monitoring

Extend dashboard to show IFT metrics
Track instruction-following capability
Monitor domain knowledge retention

Phase 3: Advanced Features 🚀

3.1 Quality Control

Conversation quality filtering
Response length validation
Domain alignment scoring

3.2 Evaluation System

Instruction-following benchmarks
Domain knowledge retention tests
Conversational capability assessment

3.3 Pipeline Automation

Automatic CPT → IFT progression
Optimal checkpoint selection
Hyperparameter optimization

Technical Integration Plan

1. Configuration Enhancement

# Extend existing config system with flexible conversation extraction
@dataclass
class InstructTuningConfig:
    # Inherit from base config
    base_model_path: str      # CPT checkpoint path
    base_model_name: str      # Original base model
    
    # IFT-specific parameters
    custom_data_ratio: float = 0.1  # Renamed from mnemosyne_ratio
    max_train_examples: int = 10000
    conversation_sources: List[str] = ["verbatims/2025", "conversations"]
    
    # Flexible participant pattern configuration
    participant_patterns: Dict[str, List[str]] = field(default_factory=lambda: {
        "user": ["**Laurent-Philippe**:", "**User**:", "Human:"],
        "assistant": ["**Mnemosyne**:", "**Assistant**:", "AI:"],
        "system": ["**System**:", "System:"]
    })
    
    # Pattern matching options
    use_regex: bool = True      # Enable regex pattern matching
    case_sensitive: bool = False # Case-sensitive pattern matching
    
    # Conservative post-CPT settings
    learning_rate: float = 5e-6
    max_iterations: int = 100
    
    # Architecture integration (leverages existing system)
    auto_detect_architecture: bool = True  # Use ModelArchitectureManager

2. Model Manager Integration (Leverages Existing Architecture System)

# Extend ModelManager for IFT models using existing architecture detection
class ModelManager:
    def load_ift_model(self, cpt_path: str, ift_adapter: str):
        # Load CPT base + IFT adapter
        # Auto-detect architecture using existing ModelArchitectureManager
        # Apply proper chat formatting based on detected architecture
        # Enable instruction mode with appropriate templates
        
    def format_for_instruction(self, messages: List[Dict], model_name: str):
        # Use existing ModelArchitectureManager.format_messages()
        # Handles all supported architectures automatically
        # Includes special cases like Gemma system message handling
        arch_manager = get_model_architecture_manager()
        return arch_manager.format_messages(messages, model_name)

3. Training Pipeline

# Complete CPT → IFT pipeline
def run_complete_training_pipeline():
    # 1. Run CPT training
    cpt_trainer = ContinuedPretrainer(cpt_config)
    cpt_checkpoint = cpt_trainer.run_training()
    
    # 2. Run IFT training on best CPT checkpoint
    ift_config.base_model_path = cpt_checkpoint
    ift_trainer = InstructionTuner(ift_config)
    ift_model = ift_trainer.run_tuning()
    
    return ift_model

Expected Outcomes

✅ After IFT Implementation

Complete Training Pipeline: CPT → IFT progression
Conversational Models: Domain-adapted models that follow instructions
Web Interface Support: Full IFT training through UI
Model Publishing: Publish both CPT and IFT models
Quality Assurance: Proper evaluation and validation

📊 Success Metrics

Instruction Following: Models respond appropriately to user prompts
Domain Retention: Maintain Mnemosyne knowledge after IFT
Conversation Quality: Natural, helpful responses
Pipeline Reliability: Consistent CPT → IFT progression

Conclusion

Current State: ForgeLLM has excellent CPT capabilities but lacks the critical IFT stage needed for conversational AI.

Next Steps: Implement SOTA instruction fine-tuning by adapting the proven external implementation to our codebase architecture.

Priority: ⚡ High - IFT is essential for creating usable conversational models from our CPT outputs.

Timeline: Phase 1 (Fix IFT) should be completed before adding new CPT features, as IFT is the natural next step in the training pipeline.

Key Architectural Advantages

✅ Leveraging Existing Systems

ModelArchitectureManager Integration:
- No need to reimplement chat formatting
- Automatic architecture detection
- Handles 10+ model families with proven templates
- Special case handling (Gemma system messages, etc.)
Flexible Conversation Extraction:
- Configurable participant patterns (no hardcoded names)
- Regex support for complex patterns
- Multiple pattern alternatives per role
- Reusable across different datasets and use cases
Proven Infrastructure:
- Configuration system already in place
- Model management already handles adapters
- Web interface foundation exists
- Dashboard and monitoring systems ready

🎯 Implementation Strategy

Phase 1 Priority: Fix the broken IFT system by:

Replacing XML extractor with flexible pattern-based system
Integrating ModelArchitectureManager for chat formatting
Fixing import/class naming issues in CLI
Leveraging existing config/model management systems

This approach maximizes code reuse and builds on proven, tested components rather than reimplementing functionality that already exists.

This document reflects the current state of training capabilities and provides a clear roadmap for implementing the missing instruction fine-tuning functionality that will complete our training pipeline.

FilesExpand file tree

perspectives.md

Latest commit

History