Target: 50,000 high-quality multi-step sequences
{
"id": "sha256_hash",
"version": "1.0.0",
"created_at": "2024-12-15T10:30:00Z",
"metadata": {
"conversation_id": "original_conv_id",
"tool": "Claude Code|Cline|Roo-Cline",
"project": "/path/to/project",
"primary_language": "rust",
"all_languages": ["rust", "typescript"],
"timestamp_start": "ISO8601",
"timestamp_end": "ISO8601",
"duration_seconds": 1234
},
"task": {
"user_request": "original user message",
"inferred_goal": "structured goal description",
"category": "debug|feature|refactor|setup|test",
"complexity": "simple|moderate|complex|expert"
},
"initial_state": {
"project_files": ["src/main.rs", "Cargo.toml"],
"project_description": "extracted from context",
"existing_errors": ["list of known issues"],
"git_state": "clean|dirty|branch_info"
},
"trajectory": [
{
"step": 0,
"user_message": "text or null for AI-driven steps",
"ai_reasoning": "extracted thinking/planning",
"ai_response": "text response to user",
"tool_calls": [
{
"tool": "bash|read|write|edit|grep|glob|task",
"parameters": {
"command": "cargo build",
"file_path": "/path/to/file",
"pattern": "regex"
},
"result": {
"success": true,
"output": "tool output",
"exit_code": 0,
"error": null
},
"tokens_used": {"input": 100, "output": 50}
}
],
"state_changes": {
"files_modified": ["src/main.rs"],
"files_created": [],
"files_deleted": [],
"knowledge_gained": "discovered X uses Y pattern"
},
"step_outcome": "success|partial|failure|blocked"
}
],
"final_outcome": {
"status": "completed|abandoned|failed",
"success": true,
"final_state": "description of end state",
"artifacts_produced": ["list of outputs"],
"total_steps": 15,
"total_tool_calls": 47
},
"features": {
"planning_quality": {
"has_upfront_plan": true,
"plan_steps": 5,
"plan_followed": 0.8
},
"tool_usage": {
"unique_tools": ["bash", "edit", "read"],
"tool_call_count": 47,
"tool_diversity_score": 0.6,
"bash_commands": 12,
"file_operations": 23
},
"complexity_metrics": {
"files_touched": 8,
"total_tokens": 45000,
"context_switches": 3,
"max_context_length": 12000,
"multi_step_reasoning": true
},
"error_handling": {
"errors_encountered": 5,
"recovery_attempts": 5,
"recovery_success_rate": 0.8,
"retry_strategies": ["reread context", "alternative approach"]
},
"agent_capabilities": {
"reads_before_write": true,
"validates_changes": true,
"uses_context_effectively": 0.9,
"adapts_to_errors": true,
"parallel_tool_calls": false
}
},
"quality_score": {
"overall": 0.85,
"completeness": 0.9,
"coherence": 0.8,
"reusability": 0.85,
"educational_value": 0.7
},
"annotations": {
"key_decisions": ["decided to refactor instead of patch"],
"interesting_patterns": ["used grep to find all usages"],
"failure_points": ["initially tried wrong approach"],
"learning_moments": ["discovered project uses async pattern"]
}
}KEEP if ALL:
- ✅ Total tool calls ≥ 5
- ✅ At least 2 different tool types used
- ✅ Has clear user request
- ✅ Has measurable outcome (success or instructive failure)
- ✅ No data corruption (valid JSON, complete messages)
- ✅ At least one file operation (read/write/edit)
- ✅ Tokens < 500k (avoid extreme outliers)
REJECT if ANY:
- ❌ Single tool call only
- ❌ Trivial tasks (e.g., "what is X?")
- ❌ Data corruption or truncation
- ❌ No clear task goal
- ❌ Spam or test conversations
- ❌ Contains PII that can't be sanitized
TIERED QUALITY:
- Gold (20%): Multi-step, clear planning, successful, uses 4+ tools, >10 steps
- Silver (40%): Clear task, 2-3 tools, 5-10 steps, may have failures
- Bronze (40%): Simpler but valid examples, good for diversity
def extract_agentic_features(conversation):
features = {}
# 1. PLANNING DETECTION
features['has_plan'] = detect_planning_language(first_ai_message)
features['plan_quality'] = measure_plan_specificity(plan)
# 2. TOOL USE PATTERNS
features['tool_sequence'] = extract_tool_call_sequence()
features['tool_diversity'] = len(unique_tools) / total_tools
features['parallel_tool_use'] = detect_parallel_patterns()
# 3. CONTEXT MANAGEMENT
features['context_window_usage'] = track_context_references()
features['memory_efficiency'] = measure_redundant_reads()
# 4. ERROR RECOVERY
features['error_types'] = categorize_errors()
features['recovery_strategy'] = identify_recovery_patterns()
features['retry_intelligence'] = measure_adaptation()
# 5. TASK DECOMPOSITION
features['subtasks'] = identify_logical_subtasks()
features['dependency_graph'] = build_task_dependencies()
# 6. CODE UNDERSTANDING
features['reads_before_writes'] = check_read_write_order()
features['validates_changes'] = detect_verification_steps()
return featuresTotal: 50,000 examples
STRATIFIED SPLIT by:
- Complexity (simple/moderate/complex/expert): 25/35/30/10%
- Tool diversity (low/medium/high): 33/34/33%
- Outcome (success/failure): 75/25%
- Language (top 5 languages + other)
SPLIT RATIOS:
├─ Train: 35,000 (70%)
├─ Validation: 7,500 (15%)
└─ Test: 7,500 (15%)
SPLIT RULES:
1. Entire conversations stay together (no turn splitting)
2. Same project conversations grouped to avoid leakage
3. Temporal holdout: Last 10% of data by time → test set
4. Diversity enforcement: Ensure rare tool combinations in all splits
5. Difficulty distribution maintained across splits
def deduplicate_trajectories():
# Level 1: Exact duplicates
remove_identical_hashes()
# Level 2: Near-duplicates (edit distance)
remove_similar_trajectories(threshold=0.95)
# Level 3: Same project, same error, same fix
remove_repetitive_debugging_sessions()
# Level 4: Keep diverse examples
if same_goal and same_outcome:
keep_most_complex_trajectory()def validate_example(example):
checks = {
'has_valid_json': check_json_parseable(),
'has_user_request': example['task']['user_request'] is not None,
'has_trajectory': len(example['trajectory']) > 0,
'tools_are_valid': all(t in VALID_TOOLS for t in tools),
'has_outcome': example['final_outcome']['status'] in VALID_STATUSES,
'no_pii': not contains_pii(example),
'reasonable_length': 10 < len(trajectory) < 200,
'coherent': check_conversation_coherence(example)
}
return all(checks.values()), checksTarget: 80,000 error-fix pairs
{
"id": "sha256_hash",
"version": "1.0.0",
"created_at": "2024-12-15T10:30:00Z",
"metadata": {
"conversation_id": "parent_conv",
"language": "rust",
"framework": "tokio|react|django|null",
"project": "/path/to/project",
"timestamp": "ISO8601",
"tool": "Claude Code|Cline|..."
},
"error": {
"type": "compile_error|runtime_error|type_error|lint_error|test_failure",
"source": "compiler|interpreter|linter|test_framework|runtime",
"severity": "error|warning|info",
"raw_output": "full error message with formatting",
"parsed": {
"error_code": "E0308|TS2345|ImportError",
"message": "cleaned error message",
"file": "src/main.rs",
"line": 42,
"column": 15,
"suggestion": "compiler suggestion if any"
},
"stack_trace": "full stack trace if available",
"related_errors": ["other errors that appeared together"],
"category": {
"primary": "type_mismatch|undefined_variable|syntax|logic|...",
"secondary": ["borrow_checker", "lifetime"],
"difficulty": "trivial|easy|medium|hard|expert"
}
},
"context": {
"problematic_code": {
"file_path": "relative/path/to/file.rs",
"line_start": 40,
"line_end": 50,
"code": "actual code block",
"syntax_highlighted": "with language"
},
"surrounding_context": {
"before": "20 lines before",
"after": "20 lines after",
"full_file_available": true
},
"dependencies": {
"imports": ["use std::collections::HashMap"],
"related_functions": ["function definitions used"],
"type_definitions": ["relevant structs/types"]
},
"related_files": [
{
"path": "src/models.rs",
"relevance": "defines type used here",
"excerpt": "relevant snippet"
}
],
"project_context": {
"build_tool": "cargo|npm|pip|...",
"dependencies_file": "Cargo.toml content",
"language_version": "rust 1.70",
"target": "binary|library|..."
}
},
"debugging_process": {
"turns": [
{
"turn": 0,
"user": "error message or question",
"assistant": "analysis and suggestion",
"actions_taken": ["read file", "check types"],
"hypothesis": "I think the issue is X"
}
],
"total_turns": 3,
"time_to_solve_seconds": 120
},
"solution": {
"fix_type": "syntax_fix|type_annotation|refactor|logic_change|dependency_update",
"approach": "direct_fix|workaround|refactor|upgrade",
"code_changes": {
"before": "problematic code",
"after": "fixed code",
"diff_unified": "unified diff format",
"lines_added": 2,
"lines_removed": 1,
"lines_modified": 3
},
"explanation": {
"root_cause": "detailed explanation of WHY error occurred",
"fix_rationale": "WHY this fix works",
"learning": "what to remember for future",
"alternatives": ["other ways to fix this"]
},
"verification": {
"method": "compilation|tests|manual_check",
"compiled_successfully": true,
"tests_passed": true,
"user_confirmed_fix": true,
"follow_up_errors": []
}
},
"features": {
"error_characteristics": {
"is_common_error": true,
"beginner_mistake": false,
"language_specific": true,
"framework_specific": false
},
"fix_characteristics": {
"complexity": "simple|moderate|complex",
"requires_understanding": ["type_system", "borrow_checker"],
"requires_refactor": false,
"one_line_fix": false,
"breaking_change": false
},
"educational_value": {
"teaches_concept": ["ownership", "lifetimes"],
"common_pitfall": true,
"has_clear_pattern": true,
"reusable": 0.9
}
},
"quality_score": {
"overall": 0.88,
"error_clarity": 0.9,
"context_completeness": 0.85,
"solution_quality": 0.9,
"explanation_quality": 0.85,
"verification": 1.0
}
}KEEP if ALL:
- ✅ Has clear error message
- ✅ Has before/after code
- ✅ Code actually changed (before ≠ after)
- ✅ Has some form of verification
- ✅ Error and fix are related
- ✅ Code is syntactically valid after fix
- ✅ No PII in code/paths
REJECT if ANY:
- ❌ No error message
- ❌ Before == After (no actual fix)
- ❌ Incomplete/truncated
- ❌ Not a real error (just questions)
- ❌ Fix unrelated to error
- ❌ Contains credentials/keys/secrets
QUALITY TIERS:
- Gold (30%): Complete context, verified fix, great explanation, common error
- Silver (50%): Good context, likely correct fix, decent explanation
- Bronze (20%): Minimal but valid, for diversity
def extract_debugging_features(error_fix_pair):
features = {}
# 1. ERROR CATEGORIZATION
features['error_category'] = categorize_error_type(error.message)
features['error_patterns'] = extract_error_patterns(error.raw_output)
features['language_specific'] = detect_language_specific_error()
# 2. ERROR DIFFICULTY
features['difficulty'] = estimate_difficulty(
turns=debugging_process.turns,
code_complexity=analyze_code_complexity(),
concept_depth=detect_concepts_required()
)
# 3. FIX COMPLEXITY
features['fix_complexity'] = measure_fix_complexity(
lines_changed=solution.lines_added + lines_removed,
scope=detect_change_scope(), # local/file/project
breaking=is_breaking_change()
)
# 4. EDUCATIONAL VALUE
features['teaches'] = extract_concepts_taught(explanation)
features['common_pattern'] = is_common_error_pattern()
features['reusability'] = estimate_reusability()
# 5. CONTEXT QUALITY
features['context_completeness'] = (
has_full_file * 0.3 +
has_dependencies * 0.2 +
has_related_files * 0.2 +
has_stack_trace * 0.3
)
return featuresTotal: 80,000 examples
STRATIFIED by:
- Language (rust/ts/python/go/...): actual distribution
- Error type (compile/runtime/type/...): 40/30/20/10%
- Difficulty (trivial/easy/medium/hard/expert): 10/25/35/20/10%
- Fix complexity (simple/moderate/complex): 40/40/20%
SPLIT RATIOS:
├─ Train: 56,000 (70%)
├─ Validation: 12,000 (15%)
└─ Test: 12,000 (15%)
SPECIAL TEST SETS:
├─ Seen errors, unseen context (6,000)
├─ Unseen error patterns (3,000)
└─ Hardest examples only (3,000)
LEAKAGE PREVENTION:
- Same error message → same split
- Same project → same split
- Temporal: earlier errors in train, later in test
def deduplicate_errors():
# Level 1: Identical error + fix
remove_exact_duplicates()
# Level 2: Same error, similar fix (fuzzy match)
cluster_by_error_message()
keep_best_explanation_per_cluster()
# Level 3: Repeated patterns in same project
if same_project and same_error_type:
keep_first_occurrence_plus_unique_variations()
# Level 4: Language-specific common errors
ensure_diversity_within_error_categories()def validate_error_fix(example):
checks = {
'has_error': example['error']['raw_output'] is not None,
'has_code_before': example['context']['problematic_code'] is not None,
'has_code_after': example['solution']['code_changes']['after'] is not None,
'code_changed': before != after,
'has_explanation': example['solution']['explanation']['root_cause'] is not None,
'verified': example['solution']['verification']['compiled_successfully'],
'no_pii': not contains_pii(example),
'parseable': can_parse_code(before) and can_parse_code(after),
'language_match': detected_language == metadata_language,
'reasonable_diff': 1 <= lines_changed <= 1000
}
quality_score = compute_quality_score(example)
return all(checks.values()) and quality_score > 0.6, checksTarget: 1,500 ultra-long conversations
{
"id": "sha256_hash",
"version": "1.0.0",
"created_at": "2024-12-15T10:30:00Z",
"metadata": {
"conversation_id": "original_id",
"tool": "Claude Code",
"project": "/path/to/project",
"timestamp_start": "ISO8601",
"timestamp_end": "ISO8601",
"duration_hours": 19.3,
"session_type": "single_session|multi_session"
},
"statistics": {
"total_turns": 2487,
"total_tokens": 8629763,
"token_distribution": {
"min_turn": 50,
"max_turn": 15000,
"mean_turn": 3470,
"median_turn": 2100
},
"user_tokens": 2400000,
"assistant_tokens": 6229763,
"cumulative_context_used": [/* array of context sizes per turn */]
},
"messages": [
{
"turn": 0,
"timestamp": "ISO8601",
"role": "user|assistant",
"content": "message text",
"tool_calls": [/* if assistant */],
"tool_results": [/* results */],
"tokens": {
"input": 5000,
"output": 1200,
"cumulative": 6200
},
"context_window": {
"total_context": 50000,
"new_input": 1200,
"preserved_history": 48800,
"dropped_history": 0
},
"references": {
"references_turn": [5, 12, 45], /* which previous turns are referenced */
"references_files": ["src/main.rs"],
"references_concepts": ["authentication", "JWT"]
}
}
],
"conversation_structure": {
"main_goal": "build complete authentication system",
"phases": [
{
"phase": 0,
"description": "planning and design",
"turns": [0, 50],
"outcome": "completed"
},
{
"phase": 1,
"description": "database schema implementation",
"turns": [51, 200],
"outcome": "completed"
}
],
"sub_tasks": [
{
"task_id": "t1",
"description": "implement JWT token generation",
"turns": [100, 150],
"parent_task": null,
"child_tasks": ["t1.1", "t1.2"],
"completed": true
}
],
"context_switches": [
{
"turn": 200,
"from_topic": "frontend",
"to_topic": "backend",
"reason": "needed to fix API first"
}
],
"knowledge_evolution": [
{
"turn": 50,
"discovery": "project uses custom auth middleware",
"impact": "changed approach to use existing patterns"
}
]
},
"context_analysis": {
"files_involved": {
"total_unique": 45,
"by_turn": [/* files introduced at each turn */],
"most_referenced": ["src/auth.rs", "src/db.rs"]
},
"long_range_dependencies": [
{
"turn": 500,
"references_turn": 50,
"distance": 450,
"content": "referred back to initial design decision"
}
],
"information_density": [
/* per-turn score of new information vs repetition */
],
"context_efficiency": {
"unique_concepts_ratio": 0.75,
"repetition_ratio": 0.25,
"context_reuse_score": 0.8
}
},
"features": {
"conversation_characteristics": {
"is_multi_session": true,
"has_breaks": true,
"maintains_coherence": 0.9,
"topic_diversity": 0.7
},
"complexity_metrics": {
"project_scope": "large",
"files_modified": 45,
"languages_used": ["rust", "sql", "typescript"],
"frameworks": ["tokio", "diesel"],
"architectural_decisions": 12
},
"problem_solving": {
"obstacles_encountered": 23,
"pivots_made": 5,
"rollbacks": 3,
"successful_completions": 18,
"learning_moments": 15
},
"context_requirements": {
"minimum_context_window": 32000,
"optimal_context_window": 128000,
"requires_rag": false,
"benefits_from_long_context": true
}
},
"quality_score": {
"overall": 0.92,
"coherence": 0.95,
"completeness": 0.9,
"educational_value": 0.9,
"context_usage": 0.92,
"task_success": 0.88
},
"annotations": {
"summary": "comprehensive authentication system implementation",
"key_challenges": ["database migration issues", "JWT expiration handling"],
"solutions_discovered": ["custom middleware pattern", "refresh token rotation"],
"architectural_insights": ["separation of auth from business logic"],
"notable_techniques": ["multi-step planning", "incremental testing"]
}
}KEEP if ALL:
- ✅ Total tokens ≥ 100,000
- ✅ Total turns ≥ 50
- ✅ Maintains coherence (not random topic jumping)
- ✅ Has clear goal/project
- ✅ Multiple context switches
- ✅ Long-range dependencies exist
- ✅ Task has meaningful outcome
REJECT if ANY:
- ❌ Tokens < 100k
- ❌ Incoherent (spam, tests, corrupted)
- ❌ Single topic, no depth
- ❌ Abandoned very early
- ❌ Data corruption
QUALITY TIERS:
- Diamond (10%): >1M tokens, multi-session, complex project, successful
- Gold (30%): >500k tokens, coherent, multiple phases
- Silver (40%): >200k tokens, single project, decent completion
- Bronze (20%): >100k tokens, valid but simpler
def extract_long_context_features(conversation):
features = {}
# 1. CONTEXT WINDOW ANALYSIS
features['context_usage'] = analyze_context_window_usage(messages)
features['long_range_deps'] = find_long_range_dependencies(messages)
features['context_efficiency'] = measure_information_density()
# 2. CONVERSATION STRUCTURE
features['phases'] = segment_conversation_into_phases()
features['task_graph'] = build_task_dependency_graph()
features['topic_evolution'] = track_topic_changes()
# 3. KNOWLEDGE TRACKING
features['knowledge_building'] = track_cumulative_understanding()
features['discoveries'] = identify_key_insights()
features['pivots'] = detect_strategy_changes()
# 4. CONTEXT SWITCHES
features['switches'] = detect_context_switches()
features['switch_reasons'] = classify_switch_reasons()
# 5. FILE/CODE TRACKING
features['file_timeline'] = track_file_introduction_and_usage()
features['code_evolution'] = track_code_changes_over_time()
# 6. PROBLEM SOLVING PATTERNS
features['obstacle_resolution'] = identify_problems_and_solutions()
features['planning_depth'] = measure_lookahead_planning()
return featuresTotal: 1,500 conversations
STRATIFIED by:
- Token length: 100k-250k (30%), 250k-500k (30%), 500k-1M (25%), >1M (15%)
- Completion: completed (60%), partial (30%), abandoned (10%)
- Complexity: simple (20%), moderate (40%), complex (30%), expert (10%)
- Duration: single-session (40%), multi-session (60%)
SPLIT RATIOS:
├─ Train: 900 (60%) # Smaller due to rarity
├─ Validation: 300 (20%)
└─ Test: 300 (20%)
SPECIAL CONSIDERATIONS:
- Entire conversations only (never split)
- Same project → same split
- Temporal holdout: last 20% by time → test
- Ensure token distribution balanced across splits
- Hold out some ultra-long (>5M) for special test set
def deduplicate_long_conversations():
# Level 1: Exact duplicates (unlikely but check)
remove_identical_conversations()
# Level 2: Continuation conversations
# If conv_B starts where conv_A ends, merge or keep one
detect_and_merge_continuations()
# Level 3: Same project, same goal
if same_project and similar_goal:
keep_longest_or_most_complete()
# Level 4: Repetitive patterns
if many_similar_turns:
compress_or_flag_for_review()def validate_long_conversation(example):
checks = {
'meets_length': example['statistics']['total_tokens'] >= 100000,
'has_messages': len(example['messages']) >= 50,
'valid_json': all_messages_parseable(),
'coherent': measure_conversation_coherence() > 0.7,
'has_goal': example['conversation_structure']['main_goal'] is not None,
'no_corruption': no_truncated_messages(),
'no_pii': not contains_pii(example),
'has_context_switches': len(context_switches) >= 2,
'has_completion': has_meaningful_outcome(),
'token_distribution_reasonable': token_stddev_not_too_high()
}
quality_score = compute_quality_score(example)
return all(checks.values()) and quality_score > 0.7, checks┌─────────────────────────────────────────────────────────┐
│ 1. RAW DATA INGESTION │
│ ├─ Load 52GB backup │
│ ├─ Parse JSONL files │
│ └─ Basic validation │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ 2. CONVERSATION SEGMENTATION │
│ ├─ Identify conversation boundaries │
│ ├─ Extract metadata │
│ └─ Link related conversations │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ 3. PARALLEL EXTRACTION │
│ ├─ Extract agentic sequences → Dataset 1 │
│ ├─ Extract error-fix pairs → Dataset 2 │
│ └─ Identify long conversations → Dataset 3 │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ 4. FEATURE EXTRACTION │
│ ├─ Language detection │
│ ├─ Complexity scoring │
│ ├─ Quality metrics │
│ └─ Educational value │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ 5. QUALITY FILTERING │
│ ├─ Apply dataset-specific filters │
│ ├─ Remove PII │
│ ├─ Validate examples │
│ └─ Assign quality tiers │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ 6. DEDUPLICATION │
│ ├─ Within-dataset dedup │
│ ├─ Cross-dataset overlap detection │
│ └─ Keep best examples │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ 7. STRATIFIED SPLITTING │
│ ├─ Create train/val/test splits │
│ ├─ Maintain distributions │
│ └─ Prevent leakage │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ 8. FORMAT CONVERSION │
│ ├─ HuggingFace datasets format │
│ ├─ OpenAI JSONL format │
│ ├─ Anthropic format │
│ └─ Research formats (Parquet, CSV) │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ 9. DOCUMENTATION & METADATA │
│ ├─ Dataset cards │
│ ├─ Statistics reports │
│ ├─ Example samples │
│ └─ Usage guidelines │
└─────────────────────────────────────────────────────────┘
Track these across ALL datasets:
QUALITY_METRICS = {
# Completeness
'field_completeness': 0.95, # % of fields populated
'no_null_critical_fields': 1.0,
# Validity
'json_parseable': 1.0,
'schema_compliant': 1.0,
'language_detected': 0.98,
# Privacy
'pii_removed': 1.0,
'paths_sanitized': 1.0,
'secrets_removed': 1.0,
# Diversity
'language_diversity': 0.8, # coverage of languages
'complexity_distribution': 'balanced',
'outcome_diversity': 'includes failures',
# Usefulness
'educational_value': 0.7,
'reusability': 0.75,
'real_world_applicability': 0.9
}datasets/
├─ agentic_tool_use/
│ ├─ train.jsonl (35k examples)
│ ├─ val.jsonl (7.5k)
│ ├─ test.jsonl (7.5k)
│ ├─ dataset_card.md
│ ├─ statistics.json
│ └─ samples/ (100 examples for preview)
│
├─ code_debugging/
│ ├─ train.jsonl (56k)
│ ├─ val.jsonl (12k)
│ ├─ test.jsonl (12k)
│ ├─ by_language/
│ │ ├─ rust.jsonl
│ │ ├─ typescript.jsonl
│ │ └─ ...
│ ├─ by_difficulty/
│ │ ├─ beginner.jsonl
│ │ ├─ intermediate.jsonl
│ │ └─ expert.jsonl
│ └─ dataset_card.md
│
├─ long_context/
│ ├─ train.jsonl (900)
│ ├─ val.jsonl (300)
│ ├─ test.jsonl (300)
│ ├─ ultra_long/ (>1M tokens)
│ └─ dataset_card.md
│
├─ huggingface/
│ ├─ agentic_tool_use/ (HF format)
│ ├─ code_debugging/
│ └─ long_context/
│
├─ openai_finetune/
│ ├─ agentic_train.jsonl
│ ├─ debugging_train.jsonl
│ └─ ...
│
└─ statistics/
├─ global_stats.json
├─ quality_report.md
└─ dataset_comparison.md
Want me to build this complete pipeline? I'll create:
- Enhanced
preparecommand with all extractors - Quality validation framework
- Feature extraction modules
- Stratified splitting logic
- Deduplication engine
- Format converters (HF, OpenAI, Anthropic)
- Dataset statistics & visualization
- Documentation generator
This will give you production-grade datasets ready to:
- Train custom models
- Publish on HuggingFace
- License commercially
- Use for research
Ready to build?