Skip to content

scouzi1966/qwen-humaneval

Repository files navigation

πŸ§ͺ LLM Coding Benchmarks with Ollama

Python 3.8+ License: MIT Ollama

An automated benchmarking system for evaluating coding capabilities of Large Language Models running on Ollama. Supports industry-standard benchmarks with comprehensive evaluation metrics and safe code execution.

πŸš€ Features

  • πŸ“Š Multiple Benchmarks: HumanEval and MBPP support with 164+974 coding problems
  • πŸ”’ Safe Execution: Sandboxed code execution with resource limits and timeout protection
  • πŸ“ˆ Comprehensive Metrics: Pass@1, Pass@5, Pass@10 evaluation with detailed analysis
  • βš™οΈ Flexible Configuration: YAML-based configuration system for all parameters
  • πŸ’Ύ Results Management: JSON and CSV export with comparison tools and visualizations
  • πŸ–₯️ CLI Interface: Easy-to-use command-line interface with extensive options
  • πŸ“ Detailed Logging: Complete prompt/response logging for analysis and debugging
  • πŸ”§ Cross-Platform: Works on macOS, Linux, and Windows with automatic compatibility handling

πŸ“¦ Installation

Prerequisites

  1. Ollama Server: Install and start Ollama

    # Install Ollama (see https://ollama.ai for platform-specific instructions)
    curl -fsSL https://ollama.ai/install.sh | sh
    
    # Start Ollama server
    ollama serve
  2. Python Environment: Python 3.8+ required

    # Clone the repository
    git clone <your-repo-url>
    cd bench-code
    
    # Create virtual environment (recommended)
    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
    
    # Install dependencies
    pip install -r requirements.txt

⚑ Quick Start

1. Configure Your Setup

Edit config.yaml to match your environment:

ollama:
  host: "http://localhost:11434"
  model: "your-model-name"  # e.g., "codellama:7b"
  timeout: 300
  parameters:
    temperature: 0.2
    top_p: 0.9
    num_ctx: 4096

logging:
  prompt_logging: true  # Enable detailed prompt/response logging
  prompt_log_dir: "logs"

2. Run Your First Benchmark

# Test setup (recommended first step)
python benchmark_runner.py --dry-run

# Run a small test
python benchmark_runner.py --benchmark humaneval --max-samples 5

# Full benchmark suite
python benchmark_runner.py

# Pull model and run with custom settings
python benchmark_runner.py --pull-model --generations 3

3. View Results

# View latest prompt/response logs
python view_prompts.py

# Check results directory
ls results/

# View specific results
python view_prompts.py --task-id "HumanEval/0" --export-text analysis.txt

πŸ“‹ Benchmark Details

HumanEval

  • Problems: 164 Python programming tasks from OpenAI
  • Focus: Function completion from docstrings
  • Evaluation: Functional correctness via unit tests
  • Example: Complete def has_close_elements(numbers, threshold):

MBPP (Mostly Basic Programming Problems)

  • Problems: 974 entry-level Python programming tasks
  • Focus: Program synthesis from natural language descriptions
  • Evaluation: Multiple test cases per problem
  • Example: "Write a function that returns the sum of squares of even numbers"

🎯 Command Line Options

python benchmark_runner.py [OPTIONS]

Options:
  -c, --config PATH         Configuration file (default: config.yaml)
  -b, --benchmark CHOICE    Benchmark: humaneval, mbpp, all (default: all)
  -n, --max-samples INT     Maximum problems to test
  -g, --generations INT     Generations per problem (default: 1)
  -m, --model TEXT          Override model from config
  --host TEXT              Override Ollama host
  -o, --output-dir PATH     Override output directory
  --log-level LEVEL        DEBUG, INFO, WARNING, ERROR
  --dry-run                Test setup without running
  --pull-model             Pull model before running
  --help                   Show help message

πŸ“Š Results and Analysis

Output Formats

JSON Files (Complete Results)

results/humaneval_model_20250801_143052.json
  • Complete benchmark results with all details
  • Individual problem results and generated code
  • Execution logs and error information
  • Pass@k metrics and timing data

CSV Files (Summary Data)

results/humaneval_model_20250801_143052_summary.csv
results/humaneval_model_20250801_143052_detailed.csv
  • Summary: High-level metrics and Pass@k scores
  • Detailed: Per-problem results with execution details

Prompt/Response Logs

logs/humaneval_prompts_20250801_143052.jsonl
  • Complete interaction logs in JSONL format
  • Each entry: prompt, response, extracted code, execution results, timing

Example Output

HumanEval Results:
  Problems: 164
  Pass@1: 0.427 (42.7%)
  Pass@5: 0.573 (57.3%)
  Pass@10: 0.634 (63.4%)
  
MBPP Results:
  Problems: 974
  Pass@1: 0.521 (52.1%)
  Total Time: 1,247s

πŸ” Viewing and Analyzing Results

Prompt/Response Viewer

# List available log files
python view_prompts.py --list

# View latest results
python view_prompts.py

# View specific task
python view_prompts.py --task-id "HumanEval/15"

# Show only failed attempts
python view_prompts.py --failed-only

# Export to readable text
python view_prompts.py --export-text detailed_analysis.txt

# Show statistics only
python view_prompts.py --stats

Debug and Validation Tools

# Test code extraction logic
python test_fixes.py

# Debug specific test execution
python debug_test_execution.py

βš™οΈ Configuration

Ollama Settings

ollama:
  host: "http://localhost:11434"
  model: "codellama:13b"
  timeout: 300
  parameters:
    temperature: 0.2        # Creativity vs consistency
    top_p: 0.9             # Token probability threshold
    num_ctx: 4096          # Context window size
    num_predict: 512       # Max response tokens

Benchmark Configuration

benchmarks:
  humaneval:
    enabled: true
    max_samples: null      # null = all problems
    pass_k: [1, 5, 10]     # Pass@k metrics to calculate
    
  mbpp:
    enabled: true
    max_samples: 100       # Limit for faster testing
    pass_k: [1, 5, 10]

Execution Settings

execution:
  timeout: 10              # Code execution timeout (seconds)
  max_memory: "128MB"      # Memory limit
  sandbox: true            # Enable sandboxing
  simple_mode: false       # Disable resource limits (auto on macOS)

Logging Configuration

logging:
  level: "INFO"
  file: "benchmark.log"
  prompt_logging: true     # Log all prompts/responses
  prompt_log_dir: "logs"   # Log directory

πŸ—‚οΈ Project Structure

bench-code/
β”œβ”€β”€ πŸ“„ benchmark_runner.py     # Main CLI interface
β”œβ”€β”€ πŸ”Œ ollama_client.py        # Ollama API client wrapper
β”œβ”€β”€ πŸ“š data_loader.py          # Dataset loading and management
β”œβ”€β”€ ⚑ code_executor.py        # Safe code execution sandbox
β”œβ”€β”€ πŸ§ͺ humaneval_runner.py     # HumanEval benchmark implementation
β”œβ”€β”€ πŸ§ͺ mbpp_runner.py          # MBPP benchmark implementation
β”œβ”€β”€ πŸ’Ύ results_manager.py      # Results saving and analysis
β”œβ”€β”€ πŸ“ prompt_logger.py        # Prompt/response logging system
β”œβ”€β”€ πŸ‘οΈ view_prompts.py         # Utility to view logged prompts
β”œβ”€β”€ πŸ”§ test_fixes.py           # Testing and validation tools
β”œβ”€β”€ πŸ› debug_test_execution.py # Debug utilities
β”œβ”€β”€ βš™οΈ config.yaml             # Configuration file
β”œβ”€β”€ πŸ“‹ requirements.txt        # Python dependencies
β”œβ”€β”€ πŸ“Š data/                   # Downloaded datasets
β”œβ”€β”€ πŸ“ˆ results/                # Benchmark results
└── πŸ“ logs/                   # Prompt/response logs

πŸ›‘οΈ Safety Features

  • Sandboxed Execution: Code runs in isolated environment with resource limits
  • Timeout Protection: Prevents infinite loops and long-running processes
  • Memory Limits: Prevents memory exhaustion attacks
  • File System Limits: Restricts file operations and sizes
  • Error Handling: Comprehensive error capture and logging
  • Process Isolation: Each code execution in separate process

πŸ”§ Troubleshooting

Common Issues

1. Ollama Connection Failed

# Check if Ollama is running
curl http://localhost:11434/api/tags

# Start Ollama if not running
ollama serve

# Check available models
ollama list

2. Model Not Found

# Pull the model manually
ollama pull codellama:7b

# Or use the pull flag
python benchmark_runner.py --pull-model

3. Resource Limit Errors (macOS)

# In config.yaml, enable simple mode
execution:
  simple_mode: true  # Disables problematic resource limits on macOS

4. Memory/Timeout Issues

# Adjust limits in config.yaml
execution:
  timeout: 30        # Increase timeout
  max_memory: "256MB" # Increase memory limit

5. Dataset Download Issues

# Test internet connection and run with debug logging
python benchmark_runner.py --log-level DEBUG --dry-run

Debug Mode

# Enable verbose logging
python benchmark_runner.py --log-level DEBUG

# Test with minimal sample
python benchmark_runner.py --max-samples 1 --log-level DEBUG

πŸš€ Advanced Usage

Custom Models

# Test different models
python benchmark_runner.py --model "mistral:7b"
python benchmark_runner.py --model "deepseek-coder:6.7b"

Batch Processing

#!/bin/bash
# Run multiple model comparisons
models=("codellama:7b" "codellama:13b" "deepseek-coder:6.7b")

for model in "${models[@]}"; do
    echo "Testing $model..."
    python benchmark_runner.py --model "$model" --max-samples 50
done

Result Analysis

from results_manager import ResultsManager

# Load and compare results
rm = ResultsManager("results")
latest = rm.get_latest_results("humaneval")
print(f"Pass@1: {latest['summary']['pass_at_1']:.3f}")

# Generate comparison report
files = rm.list_result_files()
rm.save_comparison_report(files, "model_comparison.csv")

Custom Benchmarks

# Extend for custom benchmarks
class CustomBenchmarkRunner:
    def __init__(self, ollama_client, data_loader, executor):
        self.client = ollama_client
        # ... implement custom logic

πŸ“ˆ Performance Optimization

For Faster Benchmarking

# Reduce samples for quick testing
benchmarks:
  humaneval:
    max_samples: 20
  mbpp:
    max_samples: 50

# Optimize generation parameters
ollama:
  parameters:
    num_predict: 256    # Shorter responses
    temperature: 0.1    # More deterministic

For Comprehensive Evaluation

# Full evaluation settings
benchmarks:
  humaneval:
    max_samples: null   # All problems
  mbpp:
    max_samples: null

# Multiple generations for Pass@k
execution:
  generations: 10       # For accurate Pass@10 metrics

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Development Setup

# Install development dependencies
pip install -r requirements-dev.txt

# Run tests
python -m pytest tests/

# Format code
black . && isort .

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

πŸ“ž Support

🏷️ Citation

If you use this benchmarking system in your research, please cite:

@software{llm_coding_benchmarks,
  title={LLM Coding Benchmarks with Ollama},
  author={Your Name},
  year={2025},
  url={https://github.com/your-username/bench-code}
}

⭐ Star this repo if you find it useful!

πŸ› Found a bug? Please report it in Issues

πŸ’‘ Have suggestions? Start a Discussion

About

πŸ§ͺ Automated LLM coding benchmarks with Ollama - HumanEval & MBPP evaluation suite with safe execution, comprehensive logging, and detailed analysis tools

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages