Continued Pre-training (CPT) is the process of further training a pre-trained language model on additional data to enhance its capabilities in specific domains or tasks. This document outlines current best practices and recent research findings.
Historically, CPT was performed exclusively on base models (non-instruction-tuned) under the assumption that instruction-tuning capabilities would be lost during continued pre-training.
Advantages of Base Models for CPT:
- ✅ Clean foundation without instruction-following constraints
- ✅ Higher learning rates and longer training possible
- ✅ No risk of degrading instruction-following capabilities
- ✅ Simpler training process
Recent research has challenged the traditional approach, revealing that instruction-tuned models can actually be superior for continued pre-training.
"Instruction-tuned Language Models are Better Knowledge Learners" (ACL 2024)
- Authors: Zhengbao Jiang, Zhiqing Sun, Weijia Shi, et al.
- arXiv: 2402.12847
- Key Finding: Instruction-tuned models outperform base models in knowledge absorption by 17.8%
- Enhanced Learning Capabilities: Instruction-tuning creates better internal representations for knowledge absorption
- Complex Document Processing: Better at extracting knowledge from intricate, multi-faceted documents
- Question-Answer Alignment: Pre-trained to understand how knowledge is accessed through questions
- Improved Generalization: Better transfer of learned knowledge to new tasks
NVIDIA UltraLong-8B (April 2025)
- Paper: "From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models"
- arXiv: 2504.06214
- Achievement: Extended Llama3.1-Instruct from 128K to 4M tokens
- Method: Efficient continued pre-training + instruction tuning
Key Techniques:
- YaRN-based scaling for position embeddings
- Careful data mixing with high-quality SFT datasets
- Maintained instruction-following while extending context
Concept: Instruction-tune BEFORE continued pre-training on documents
Benefits:
- 17.8% improvement in knowledge absorption
- Better encoding of knowledge from complex documents
- Improved question-answering capabilities
Process:
- Start with base model
- Instruction-tune on QA pairs
- Continue pre-training on domain documents
- Optional: Final instruction-tuning refinement
"Knowledge-Instruct: Effective Continual Pre-training from Limited Data using Instructions" (April 2025)
- arXiv: 2504.05571
- Method: Pure instruction-tuning with synthetic data for knowledge injection
- Advantage: Minimizes catastrophic forgetting while injecting new knowledge
Recommended Mix:
- 90-95%: New domain-specific data
- 5-10%: Original pre-training data (prevents catastrophic forgetting)
- Optional: 1-5% instruction data to maintain capabilities
- Base Models: 1e-4 to 5e-4
- Instruct Models: 2e-5 to 1e-4 (2-5x lower than base models)
- Base Models: Can train longer (500-1000+ steps)
- Instruct Models: Shorter training (100-500 steps) to preserve instruction-following
- Track domain-specific performance (primary metric)
- Monitor instruction-following capabilities (secondary metric)
- Use validation sets from both domains
Raw text documents → Tokenization → Training chunks
Raw text documents → Question-Answer pair generation → Instruction format → Training
- Use smaller LMs to generate instruction data from documents
- Create diverse question types (factual, reasoning, summarization)
- Ensure proper instruction formatting
- Replay Method: Mix original training data (5-10%)
- Elastic Weight Consolidation (EWC): Protect important parameters
- Progressive Training: Gradually introduce new data
- Regular Evaluation: Monitor performance on original tasks
- Degraded performance on general tasks
- Loss of instruction-following capabilities
- Reduced coherence in responses
- Starting from scratch with domain adaptation
- Planning extensive training (1000+ steps)
- Domain data is very different from original training
- Maximum learning rate flexibility needed
- Working with complex, multi-faceted documents
- Limited training data available
- Need to maintain instruction-following capabilities
- Documents contain question-answerable knowledge
- Knowledge retention tests
- Domain-specific benchmarks
- Task-specific performance
- Instruction-following accuracy
- General reasoning tasks
- Conversational quality
- Performance stability over time
- Knowledge retention vs. new learning balance
- Generalization to unseen tasks
- Over-training: Destroys original capabilities
- No data mixing: Leads to catastrophic forgetting
- Wrong learning rates: Too high for instruct models
- Ignoring evaluation: Not monitoring instruction capabilities
- Poor data quality: Low-quality domain data hurts performance
- Careful data curation: High-quality, relevant domain data
- Balanced training: Mix old and new data appropriately
- Regular monitoring: Track multiple performance dimensions
- Iterative approach: Start small, scale gradually
- Proper evaluation: Test both domain and general capabilities
- Mixture of Experts (MoE): Specialized experts for different domains
- Parameter-Efficient Methods: LoRA, adapters for domain-specific layers
- Dynamic Data Mixing: Adaptive ratios based on performance
- Multi-Modal CPT: Extending to vision and audio domains
- Optimal data mixing ratios for different domains
- Better synthetic data generation methods
- Long-term stability of CPT models
- Cross-domain knowledge transfer
The landscape of continued pre-training has evolved significantly. While base models remain excellent for CPT, instruction-tuned models have emerged as potentially superior knowledge learners when proper techniques are applied. The key is understanding the trade-offs and applying the right approach for your specific use case.
Key Takeaway: Don't automatically dismiss instruct models for CPT. With careful data mixing, appropriate learning rates, and proper monitoring, they can achieve superior knowledge absorption while maintaining their instruction-following capabilities.
-
Jiang, Z., et al. (2024). "Instruction-tuned Language Models are Better Knowledge Learners." ACL 2024. arXiv:2402.12847
-
Xu, C., et al. (2025). "From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models." arXiv:2504.06214
-
Ovadia, O., et al. (2025). "Knowledge-Instruct: Effective Continual Pre-training from Limited Data using Instructions." arXiv:2504.05571
-
Liu, X., et al. (2025). "Thus Spake Long-Context Large Language Model." arXiv:2502.17129
-
Li, J., et al. (2025). "WildLong: Synthesizing Realistic Long-Context Instruction Data at Scale." arXiv:2502.16684