An automated benchmarking system for evaluating coding capabilities of Large Language Models running on Ollama. Supports industry-standard benchmarks with comprehensive evaluation metrics and safe code execution.
- π Multiple Benchmarks: HumanEval and MBPP support with 164+974 coding problems
- π Safe Execution: Sandboxed code execution with resource limits and timeout protection
- π Comprehensive Metrics: Pass@1, Pass@5, Pass@10 evaluation with detailed analysis
- βοΈ Flexible Configuration: YAML-based configuration system for all parameters
- πΎ Results Management: JSON and CSV export with comparison tools and visualizations
- π₯οΈ CLI Interface: Easy-to-use command-line interface with extensive options
- π Detailed Logging: Complete prompt/response logging for analysis and debugging
- π§ Cross-Platform: Works on macOS, Linux, and Windows with automatic compatibility handling
-
Ollama Server: Install and start Ollama
# Install Ollama (see https://ollama.ai for platform-specific instructions) curl -fsSL https://ollama.ai/install.sh | sh # Start Ollama server ollama serve
-
Python Environment: Python 3.8+ required
# Clone the repository git clone <your-repo-url> cd bench-code # Create virtual environment (recommended) python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate # Install dependencies pip install -r requirements.txt
Edit config.yaml to match your environment:
ollama:
host: "http://localhost:11434"
model: "your-model-name" # e.g., "codellama:7b"
timeout: 300
parameters:
temperature: 0.2
top_p: 0.9
num_ctx: 4096
logging:
prompt_logging: true # Enable detailed prompt/response logging
prompt_log_dir: "logs"# Test setup (recommended first step)
python benchmark_runner.py --dry-run
# Run a small test
python benchmark_runner.py --benchmark humaneval --max-samples 5
# Full benchmark suite
python benchmark_runner.py
# Pull model and run with custom settings
python benchmark_runner.py --pull-model --generations 3# View latest prompt/response logs
python view_prompts.py
# Check results directory
ls results/
# View specific results
python view_prompts.py --task-id "HumanEval/0" --export-text analysis.txt- Problems: 164 Python programming tasks from OpenAI
- Focus: Function completion from docstrings
- Evaluation: Functional correctness via unit tests
- Example: Complete
def has_close_elements(numbers, threshold):
- Problems: 974 entry-level Python programming tasks
- Focus: Program synthesis from natural language descriptions
- Evaluation: Multiple test cases per problem
- Example: "Write a function that returns the sum of squares of even numbers"
python benchmark_runner.py [OPTIONS]
Options:
-c, --config PATH Configuration file (default: config.yaml)
-b, --benchmark CHOICE Benchmark: humaneval, mbpp, all (default: all)
-n, --max-samples INT Maximum problems to test
-g, --generations INT Generations per problem (default: 1)
-m, --model TEXT Override model from config
--host TEXT Override Ollama host
-o, --output-dir PATH Override output directory
--log-level LEVEL DEBUG, INFO, WARNING, ERROR
--dry-run Test setup without running
--pull-model Pull model before running
--help Show help messageresults/humaneval_model_20250801_143052.json
- Complete benchmark results with all details
- Individual problem results and generated code
- Execution logs and error information
- Pass@k metrics and timing data
results/humaneval_model_20250801_143052_summary.csv
results/humaneval_model_20250801_143052_detailed.csv
- Summary: High-level metrics and Pass@k scores
- Detailed: Per-problem results with execution details
logs/humaneval_prompts_20250801_143052.jsonl
- Complete interaction logs in JSONL format
- Each entry: prompt, response, extracted code, execution results, timing
HumanEval Results:
Problems: 164
Pass@1: 0.427 (42.7%)
Pass@5: 0.573 (57.3%)
Pass@10: 0.634 (63.4%)
MBPP Results:
Problems: 974
Pass@1: 0.521 (52.1%)
Total Time: 1,247s
# List available log files
python view_prompts.py --list
# View latest results
python view_prompts.py
# View specific task
python view_prompts.py --task-id "HumanEval/15"
# Show only failed attempts
python view_prompts.py --failed-only
# Export to readable text
python view_prompts.py --export-text detailed_analysis.txt
# Show statistics only
python view_prompts.py --stats# Test code extraction logic
python test_fixes.py
# Debug specific test execution
python debug_test_execution.pyollama:
host: "http://localhost:11434"
model: "codellama:13b"
timeout: 300
parameters:
temperature: 0.2 # Creativity vs consistency
top_p: 0.9 # Token probability threshold
num_ctx: 4096 # Context window size
num_predict: 512 # Max response tokensbenchmarks:
humaneval:
enabled: true
max_samples: null # null = all problems
pass_k: [1, 5, 10] # Pass@k metrics to calculate
mbpp:
enabled: true
max_samples: 100 # Limit for faster testing
pass_k: [1, 5, 10]execution:
timeout: 10 # Code execution timeout (seconds)
max_memory: "128MB" # Memory limit
sandbox: true # Enable sandboxing
simple_mode: false # Disable resource limits (auto on macOS)logging:
level: "INFO"
file: "benchmark.log"
prompt_logging: true # Log all prompts/responses
prompt_log_dir: "logs" # Log directorybench-code/
βββ π benchmark_runner.py # Main CLI interface
βββ π ollama_client.py # Ollama API client wrapper
βββ π data_loader.py # Dataset loading and management
βββ β‘ code_executor.py # Safe code execution sandbox
βββ π§ͺ humaneval_runner.py # HumanEval benchmark implementation
βββ π§ͺ mbpp_runner.py # MBPP benchmark implementation
βββ πΎ results_manager.py # Results saving and analysis
βββ π prompt_logger.py # Prompt/response logging system
βββ ποΈ view_prompts.py # Utility to view logged prompts
βββ π§ test_fixes.py # Testing and validation tools
βββ π debug_test_execution.py # Debug utilities
βββ βοΈ config.yaml # Configuration file
βββ π requirements.txt # Python dependencies
βββ π data/ # Downloaded datasets
βββ π results/ # Benchmark results
βββ π logs/ # Prompt/response logs
- Sandboxed Execution: Code runs in isolated environment with resource limits
- Timeout Protection: Prevents infinite loops and long-running processes
- Memory Limits: Prevents memory exhaustion attacks
- File System Limits: Restricts file operations and sizes
- Error Handling: Comprehensive error capture and logging
- Process Isolation: Each code execution in separate process
# Check if Ollama is running
curl http://localhost:11434/api/tags
# Start Ollama if not running
ollama serve
# Check available models
ollama list# Pull the model manually
ollama pull codellama:7b
# Or use the pull flag
python benchmark_runner.py --pull-model# In config.yaml, enable simple mode
execution:
simple_mode: true # Disables problematic resource limits on macOS# Adjust limits in config.yaml
execution:
timeout: 30 # Increase timeout
max_memory: "256MB" # Increase memory limit# Test internet connection and run with debug logging
python benchmark_runner.py --log-level DEBUG --dry-run# Enable verbose logging
python benchmark_runner.py --log-level DEBUG
# Test with minimal sample
python benchmark_runner.py --max-samples 1 --log-level DEBUG# Test different models
python benchmark_runner.py --model "mistral:7b"
python benchmark_runner.py --model "deepseek-coder:6.7b"#!/bin/bash
# Run multiple model comparisons
models=("codellama:7b" "codellama:13b" "deepseek-coder:6.7b")
for model in "${models[@]}"; do
echo "Testing $model..."
python benchmark_runner.py --model "$model" --max-samples 50
donefrom results_manager import ResultsManager
# Load and compare results
rm = ResultsManager("results")
latest = rm.get_latest_results("humaneval")
print(f"Pass@1: {latest['summary']['pass_at_1']:.3f}")
# Generate comparison report
files = rm.list_result_files()
rm.save_comparison_report(files, "model_comparison.csv")# Extend for custom benchmarks
class CustomBenchmarkRunner:
def __init__(self, ollama_client, data_loader, executor):
self.client = ollama_client
# ... implement custom logic# Reduce samples for quick testing
benchmarks:
humaneval:
max_samples: 20
mbpp:
max_samples: 50
# Optimize generation parameters
ollama:
parameters:
num_predict: 256 # Shorter responses
temperature: 0.1 # More deterministic# Full evaluation settings
benchmarks:
humaneval:
max_samples: null # All problems
mbpp:
max_samples: null
# Multiple generations for Pass@k
execution:
generations: 10 # For accurate Pass@10 metrics- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
# Install development dependencies
pip install -r requirements-dev.txt
# Run tests
python -m pytest tests/
# Format code
black . && isort .This project is licensed under the MIT License - see the LICENSE file for details.
- OpenAI for the HumanEval benchmark
- Google Research for the MBPP benchmark
- Ollama for the local LLM serving platform
- Hugging Face for dataset hosting and tools
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: This README and inline code comments
If you use this benchmarking system in your research, please cite:
@software{llm_coding_benchmarks,
title={LLM Coding Benchmarks with Ollama},
author={Your Name},
year={2025},
url={https://github.com/your-username/bench-code}
}β Star this repo if you find it useful!
π Found a bug? Please report it in Issues
π‘ Have suggestions? Start a Discussion