Skip to content

Latest commit

 

History

History
387 lines (307 loc) · 14.7 KB

File metadata and controls

387 lines (307 loc) · 14.7 KB

Development Perspectives: Training Pipeline Evolution

Current State Analysis

✅ What We Have: Continued Pre-Training (CPT) Only

Currently, ForgeLLM implements only Continued Pre-Training (CPT), which is the first stage of domain adaptation:

CPT Capabilities:

  • Purpose: Adapt base models to domain-specific knowledge (Mnemosyne dataset)
  • Training Type: Full fine-tuning or LoRA on raw text data
  • Data Format: Plain markdown documents processed into chunks
  • Implementation: forgellm/training/trainer.py with ContinuedPretrainer class
  • Status: ✅ Fully Functional - Successfully trains models on custom datasets
  • Output: Domain-adapted models that understand Mnemosyne concepts but lack instruction-following capabilities

❌ What We're Missing: Instruction Fine-Tuning (IFT/SFT)

The second critical stage - Instruction Fine-Tuning - is currently missing:

IFT Requirements:

  • Purpose: Transform CPT models into conversational assistants that follow instructions
  • Training Type: Fine-tuning on conversation pairs (user prompts + assistant responses)
  • Data Format: Structured conversations with proper chat templates
  • Implementation: ⚠️ Incomplete - forgellm/training/instruction_tuner.py exists but has critical issues
  • Status: ❌ Non-Functional - Broken imports, wrong data format expectations, unused code

Critical Issues with Current IFT Implementation

1. Broken Architecture

# CLI imports non-existent class
from .training.instruction_tuner import InstructionTuner  # ❌ Does not exist

# CLI calls non-existent method
tuner.run_tuning()  # ❌ Should be run_complete_pipeline()

2. Wrong Data Format Expectations

The current ConversationExtractor expects XML format that doesn't exist in our dataset:

<!-- Expected but NOT in our dataset -->
<conversation>
    <message from="user">Question</message>
    <message from="assistant">Answer</message>
</conversation>

Reality: Our dataset contains plain markdown with no conversation structure.

3. Incomplete Integration

  • No integration with existing model management
  • No web interface support
  • No proper configuration system
  • No testing or validation

Solution: Implement SOTA Instruction Fine-Tuning

Existing Assets We Can Leverage

🎯 ModelArchitectureManager System: We already have a sophisticated architecture detection and formatting system:

  • Location: forgellm/utils/model_architectures.py + model_architectures.json
  • Capabilities: Auto-detects 10+ model families (Llama, Qwen, Gemma, Mistral, Phi, etc.)
  • Chat Formatting: Proper templates for each architecture with special case handling
  • Integration: Already used in server and CLI for inference
  • Quality: Well-tested and handles edge cases (e.g., Gemma system message workarounds)

Key Methods Available:

  • detect_architecture(model_name) - Auto-detect model family
  • format_messages(messages, model_name) - Apply proper chat templates
  • is_instruct_model(model_name) - Determine if model expects instruction format
  • get_architecture_info(arch) - Get detailed architecture information

Reference Implementation Analysis

The external script instruct_tuning.py provides a complete, research-based IFT implementation that we should adapt:

Key Features to Integrate:

1. Hybrid Data Strategy (SOTA 2024-2025)

# Data mixture that prevents catastrophic forgetting
data_mixture = {
    "oasst1": 31.5%,      # General instruction following
    "openorca": 27%,      # Question answering
    "dolly": 18%,         # Task completion
    "alpaca": 13.5%,      # Instruction following
    "mnemosyne": 10%      # Domain preservation
}

2. Flexible Conversation Extraction

# Generic conversation extractor with configurable participant patterns
class FlexibleConversationExtractor:
    def __init__(self, participant_patterns: Dict[str, List[str]]):
        """
        Args:
            participant_patterns: Dict mapping roles to patterns/regex
            Example: {
                "user": ["**Laurent-Philippe**:", r"\*\*User\*\*:", "Human:"],
                "assistant": ["**Mnemosyne**:", "**Assistant**:", "AI:"]
            }
        """
        self.participant_patterns = participant_patterns
        
    def extract_conversations_from_file(self, file_path):
        # Parse using configurable patterns (string matching or regex)
        # Convert to proper conversation format
        # Preserve domain knowledge
        # Support multiple participant identification methods

Key Improvement: Instead of hardcoding **Laurent-Philippe**, use configurable patterns that can be:

  • Simple string patterns: "**Laurent-Philippe**:"
  • Regex patterns: r"\*\*Laurent-Philippe.*?:"
  • Multiple alternatives per role: ["**User**:", "**Human**:", "User:"]
  • Flexible role mapping: {"user": [...], "assistant": [...], "system": [...]}

3. Model-Specific Chat Formatting (Leverage Existing System)

# Use existing ModelArchitectureManager for proper chat formatting
from forgellm.utils.model_architectures import get_model_architecture_manager

def format_conversations_for_training(conversations, base_model_name):
    """Format conversations using existing architecture system"""
    arch_manager = get_model_architecture_manager()
    
    formatted_examples = []
    for conversation in conversations:
        # Convert conversation to messages format
        messages = conversation["messages"]
        
        # Use existing architecture detection and formatting
        formatted_text = arch_manager.format_messages(messages, base_model_name)
        formatted_examples.append({"text": formatted_text})
    
    return formatted_examples

Key Advantage: We already have a comprehensive ModelArchitectureManager that:

  • Auto-detects architectures: Llama, Qwen, Gemma, Mistral, Phi, etc.
  • Handles chat formatting: Proper templates for each model family
  • Supports system messages: Including special handling for Gemma (system as assistant)
  • Is already integrated: Used in server and CLI for inference
  • Is well-tested: Proven to work with our model loading system
  • Supports 10+ architectures: Including edge cases and fallbacks

4. Conservative Post-CPT Hyperparameters

# Prevent catastrophic forgetting of CPT knowledge
config = {
    "learning_rate": 5e-6,     # Conservative LR
    "batch_size": 4,           # Stable training
    "max_iterations": 100,     # Quick adaptation
    "lora_layers": 16,         # Parameter efficiency
    "mnemosyne_ratio": 0.1     # Domain preservation
}

Implementation Roadmap

Phase 1: Fix Current IFT Implementation ⚡ High Priority

1.1 Rename and Fix Classes

# Fix import issues
class InstructionTuner:  # Rename from SOTAInstructTrainer
    def run_tuning(self):  # Add missing method
        return self.run_complete_pipeline()

1.2 Replace ConversationExtractor with Flexible System

# Replace XML-based extractor with configurable conversation parser
class FlexibleConversationExtractor:
    """Extract conversations using configurable participant patterns"""
    
    def __init__(self, config: Dict[str, Any]):
        """
        Args:
            config: Configuration with participant patterns
            Example: {
                "participant_patterns": {
                    "user": ["**Laurent-Philippe**:", r"\*\*User\d*\*\*:"],
                    "assistant": ["**Mnemosyne**:", "**Assistant**:"]
                },
                "conversation_sources": ["verbatims/2025", "conversations"],
                "quality_filters": {...}
            }
        """
        self.config = config
        
    def extract_conversations(self, input_dirs: List[str]):
        # Parse using configurable patterns
        # Support both string matching and regex
        # Convert to instruction format with proper role mapping
        # Apply quality filtering
        # Return structured conversations

Integration with ModelArchitectureManager:

# Use existing architecture system for formatting
def format_extracted_conversations(conversations, base_model_name):
    arch_manager = get_model_architecture_manager()
    return [
        {"text": arch_manager.format_messages(conv["messages"], base_model_name)}
        for conv in conversations
    ]

1.3 Integrate Hybrid Data Loading

# Add HuggingFace dataset integration
def load_general_instruction_data(self):
    # Load OASST1, OpenOrca, Dolly, Alpaca
    # Apply proper data mixture ratios
    # Format for model-specific chat templates

Phase 2: Web Interface Integration 🌐

2.1 Add IFT Training Option

  • Extend training interface to support IFT after CPT
  • Add IFT-specific configuration options
  • Show CPT → IFT pipeline progression

2.2 Model Type Detection

  • Auto-detect base model architecture (Llama, Gemma, Qwen)
  • Apply correct chat formatting
  • Validate model compatibility

2.3 Progress Monitoring

  • Extend dashboard to show IFT metrics
  • Track instruction-following capability
  • Monitor domain knowledge retention

Phase 3: Advanced Features 🚀

3.1 Quality Control

  • Conversation quality filtering
  • Response length validation
  • Domain alignment scoring

3.2 Evaluation System

  • Instruction-following benchmarks
  • Domain knowledge retention tests
  • Conversational capability assessment

3.3 Pipeline Automation

  • Automatic CPT → IFT progression
  • Optimal checkpoint selection
  • Hyperparameter optimization

Technical Integration Plan

1. Configuration Enhancement

# Extend existing config system with flexible conversation extraction
@dataclass
class InstructTuningConfig:
    # Inherit from base config
    base_model_path: str      # CPT checkpoint path
    base_model_name: str      # Original base model
    
    # IFT-specific parameters
    custom_data_ratio: float = 0.1  # Renamed from mnemosyne_ratio
    max_train_examples: int = 10000
    conversation_sources: List[str] = ["verbatims/2025", "conversations"]
    
    # Flexible participant pattern configuration
    participant_patterns: Dict[str, List[str]] = field(default_factory=lambda: {
        "user": ["**Laurent-Philippe**:", "**User**:", "Human:"],
        "assistant": ["**Mnemosyne**:", "**Assistant**:", "AI:"],
        "system": ["**System**:", "System:"]
    })
    
    # Pattern matching options
    use_regex: bool = True      # Enable regex pattern matching
    case_sensitive: bool = False # Case-sensitive pattern matching
    
    # Conservative post-CPT settings
    learning_rate: float = 5e-6
    max_iterations: int = 100
    
    # Architecture integration (leverages existing system)
    auto_detect_architecture: bool = True  # Use ModelArchitectureManager

2. Model Manager Integration (Leverages Existing Architecture System)

# Extend ModelManager for IFT models using existing architecture detection
class ModelManager:
    def load_ift_model(self, cpt_path: str, ift_adapter: str):
        # Load CPT base + IFT adapter
        # Auto-detect architecture using existing ModelArchitectureManager
        # Apply proper chat formatting based on detected architecture
        # Enable instruction mode with appropriate templates
        
    def format_for_instruction(self, messages: List[Dict], model_name: str):
        # Use existing ModelArchitectureManager.format_messages()
        # Handles all supported architectures automatically
        # Includes special cases like Gemma system message handling
        arch_manager = get_model_architecture_manager()
        return arch_manager.format_messages(messages, model_name)

3. Training Pipeline

# Complete CPT → IFT pipeline
def run_complete_training_pipeline():
    # 1. Run CPT training
    cpt_trainer = ContinuedPretrainer(cpt_config)
    cpt_checkpoint = cpt_trainer.run_training()
    
    # 2. Run IFT training on best CPT checkpoint
    ift_config.base_model_path = cpt_checkpoint
    ift_trainer = InstructionTuner(ift_config)
    ift_model = ift_trainer.run_tuning()
    
    return ift_model

Expected Outcomes

After IFT Implementation

  1. Complete Training Pipeline: CPT → IFT progression
  2. Conversational Models: Domain-adapted models that follow instructions
  3. Web Interface Support: Full IFT training through UI
  4. Model Publishing: Publish both CPT and IFT models
  5. Quality Assurance: Proper evaluation and validation

📊 Success Metrics

  • Instruction Following: Models respond appropriately to user prompts
  • Domain Retention: Maintain Mnemosyne knowledge after IFT
  • Conversation Quality: Natural, helpful responses
  • Pipeline Reliability: Consistent CPT → IFT progression

Conclusion

Current State: ForgeLLM has excellent CPT capabilities but lacks the critical IFT stage needed for conversational AI.

Next Steps: Implement SOTA instruction fine-tuning by adapting the proven external implementation to our codebase architecture.

Priority: ⚡ High - IFT is essential for creating usable conversational models from our CPT outputs.

Timeline: Phase 1 (Fix IFT) should be completed before adding new CPT features, as IFT is the natural next step in the training pipeline.

Key Architectural Advantages

Leveraging Existing Systems

  1. ModelArchitectureManager Integration:

    • No need to reimplement chat formatting
    • Automatic architecture detection
    • Handles 10+ model families with proven templates
    • Special case handling (Gemma system messages, etc.)
  2. Flexible Conversation Extraction:

    • Configurable participant patterns (no hardcoded names)
    • Regex support for complex patterns
    • Multiple pattern alternatives per role
    • Reusable across different datasets and use cases
  3. Proven Infrastructure:

    • Configuration system already in place
    • Model management already handles adapters
    • Web interface foundation exists
    • Dashboard and monitoring systems ready

🎯 Implementation Strategy

Phase 1 Priority: Fix the broken IFT system by:

  1. Replacing XML extractor with flexible pattern-based system
  2. Integrating ModelArchitectureManager for chat formatting
  3. Fixing import/class naming issues in CLI
  4. Leveraging existing config/model management systems

This approach maximizes code reuse and builds on proven, tested components rather than reimplementing functionality that already exists.


This document reflects the current state of training capabilities and provides a clear roadmap for implementing the missing instruction fine-tuning functionality that will complete our training pipeline.