🧪 LLM Coding Benchmarks with Ollama

An automated benchmarking system for evaluating coding capabilities of Large Language Models running on Ollama. Supports industry-standard benchmarks with comprehensive evaluation metrics and safe code execution.

🚀 Features

📊 Multiple Benchmarks: HumanEval and MBPP support with 164+974 coding problems
🔒 Safe Execution: Sandboxed code execution with resource limits and timeout protection
📈 Comprehensive Metrics: Pass@1, Pass@5, Pass@10 evaluation with detailed analysis
⚙️ Flexible Configuration: YAML-based configuration system for all parameters
💾 Results Management: JSON and CSV export with comparison tools and visualizations
🖥️ CLI Interface: Easy-to-use command-line interface with extensive options
📝 Detailed Logging: Complete prompt/response logging for analysis and debugging
🔧 Cross-Platform: Works on macOS, Linux, and Windows with automatic compatibility handling

📦 Installation

Prerequisites

Ollama Server: Install and start Ollama

# Install Ollama (see https://ollama.ai for platform-specific instructions)
curl -fsSL https://ollama.ai/install.sh | sh

# Start Ollama server
ollama serve

Python Environment: Python 3.8+ required

# Clone the repository
git clone <your-repo-url>
cd bench-code

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

⚡ Quick Start

1. Configure Your Setup

Edit config.yaml to match your environment:

ollama:
  host: "http://localhost:11434"
  model: "your-model-name"  # e.g., "codellama:7b"
  timeout: 300
  parameters:
    temperature: 0.2
    top_p: 0.9
    num_ctx: 4096

logging:
  prompt_logging: true  # Enable detailed prompt/response logging
  prompt_log_dir: "logs"

2. Run Your First Benchmark

# Test setup (recommended first step)
python benchmark_runner.py --dry-run

# Run a small test
python benchmark_runner.py --benchmark humaneval --max-samples 5

# Full benchmark suite
python benchmark_runner.py

# Pull model and run with custom settings
python benchmark_runner.py --pull-model --generations 3

3. View Results

# View latest prompt/response logs
python view_prompts.py

# Check results directory
ls results/

# View specific results
python view_prompts.py --task-id "HumanEval/0" --export-text analysis.txt

📋 Benchmark Details

HumanEval

Problems: 164 Python programming tasks from OpenAI
Focus: Function completion from docstrings
Evaluation: Functional correctness via unit tests
Example: Complete def has_close_elements(numbers, threshold):

MBPP (Mostly Basic Programming Problems)

Problems: 974 entry-level Python programming tasks
Focus: Program synthesis from natural language descriptions
Evaluation: Multiple test cases per problem
Example: "Write a function that returns the sum of squares of even numbers"

🎯 Command Line Options

python benchmark_runner.py [OPTIONS]

Options:
  -c, --config PATH         Configuration file (default: config.yaml)
  -b, --benchmark CHOICE    Benchmark: humaneval, mbpp, all (default: all)
  -n, --max-samples INT     Maximum problems to test
  -g, --generations INT     Generations per problem (default: 1)
  -m, --model TEXT          Override model from config
  --host TEXT              Override Ollama host
  -o, --output-dir PATH     Override output directory
  --log-level LEVEL        DEBUG, INFO, WARNING, ERROR
  --dry-run                Test setup without running
  --pull-model             Pull model before running
  --help                   Show help message

📊 Results and Analysis

Output Formats

JSON Files (Complete Results)

results/humaneval_model_20250801_143052.json

Complete benchmark results with all details
Individual problem results and generated code
Execution logs and error information
Pass@k metrics and timing data

CSV Files (Summary Data)

results/humaneval_model_20250801_143052_summary.csv
results/humaneval_model_20250801_143052_detailed.csv

Summary: High-level metrics and Pass@k scores
Detailed: Per-problem results with execution details

Prompt/Response Logs

logs/humaneval_prompts_20250801_143052.jsonl

Complete interaction logs in JSONL format
Each entry: prompt, response, extracted code, execution results, timing

Example Output

HumanEval Results:
  Problems: 164
  Pass@1: 0.427 (42.7%)
  Pass@5: 0.573 (57.3%)
  Pass@10: 0.634 (63.4%)
  
MBPP Results:
  Problems: 974
  Pass@1: 0.521 (52.1%)
  Total Time: 1,247s

🔍 Viewing and Analyzing Results

Prompt/Response Viewer

# List available log files
python view_prompts.py --list

# View latest results
python view_prompts.py

# View specific task
python view_prompts.py --task-id "HumanEval/15"

# Show only failed attempts
python view_prompts.py --failed-only

# Export to readable text
python view_prompts.py --export-text detailed_analysis.txt

# Show statistics only
python view_prompts.py --stats

Debug and Validation Tools

# Test code extraction logic
python test_fixes.py

# Debug specific test execution
python debug_test_execution.py

⚙️ Configuration

Ollama Settings

ollama:
  host: "http://localhost:11434"
  model: "codellama:13b"
  timeout: 300
  parameters:
    temperature: 0.2        # Creativity vs consistency
    top_p: 0.9             # Token probability threshold
    num_ctx: 4096          # Context window size
    num_predict: 512       # Max response tokens

Benchmark Configuration

benchmarks:
  humaneval:
    enabled: true
    max_samples: null      # null = all problems
    pass_k: [1, 5, 10]     # Pass@k metrics to calculate
    
  mbpp:
    enabled: true
    max_samples: 100       # Limit for faster testing
    pass_k: [1, 5, 10]

Execution Settings

execution:
  timeout: 10              # Code execution timeout (seconds)
  max_memory: "128MB"      # Memory limit
  sandbox: true            # Enable sandboxing
  simple_mode: false       # Disable resource limits (auto on macOS)

Logging Configuration

logging:
  level: "INFO"
  file: "benchmark.log"
  prompt_logging: true     # Log all prompts/responses
  prompt_log_dir: "logs"   # Log directory

🗂️ Project Structure

bench-code/
├── 📄 benchmark_runner.py     # Main CLI interface
├── 🔌 ollama_client.py        # Ollama API client wrapper
├── 📚 data_loader.py          # Dataset loading and management
├── ⚡ code_executor.py        # Safe code execution sandbox
├── 🧪 humaneval_runner.py     # HumanEval benchmark implementation
├── 🧪 mbpp_runner.py          # MBPP benchmark implementation
├── 💾 results_manager.py      # Results saving and analysis
├── 📝 prompt_logger.py        # Prompt/response logging system
├── 👁️ view_prompts.py         # Utility to view logged prompts
├── 🔧 test_fixes.py           # Testing and validation tools
├── 🐛 debug_test_execution.py # Debug utilities
├── ⚙️ config.yaml             # Configuration file
├── 📋 requirements.txt        # Python dependencies
├── 📊 data/                   # Downloaded datasets
├── 📈 results/                # Benchmark results
└── 📝 logs/                   # Prompt/response logs

🛡️ Safety Features

Sandboxed Execution: Code runs in isolated environment with resource limits
Timeout Protection: Prevents infinite loops and long-running processes
Memory Limits: Prevents memory exhaustion attacks
File System Limits: Restricts file operations and sizes
Error Handling: Comprehensive error capture and logging
Process Isolation: Each code execution in separate process

🔧 Troubleshooting

Common Issues

1. Ollama Connection Failed

# Check if Ollama is running
curl http://localhost:11434/api/tags

# Start Ollama if not running
ollama serve

# Check available models
ollama list

2. Model Not Found

# Pull the model manually
ollama pull codellama:7b

# Or use the pull flag
python benchmark_runner.py --pull-model

3. Resource Limit Errors (macOS)

# In config.yaml, enable simple mode
execution:
  simple_mode: true  # Disables problematic resource limits on macOS

4. Memory/Timeout Issues

# Adjust limits in config.yaml
execution:
  timeout: 30        # Increase timeout
  max_memory: "256MB" # Increase memory limit

5. Dataset Download Issues

# Test internet connection and run with debug logging
python benchmark_runner.py --log-level DEBUG --dry-run

Debug Mode

# Enable verbose logging
python benchmark_runner.py --log-level DEBUG

# Test with minimal sample
python benchmark_runner.py --max-samples 1 --log-level DEBUG

🚀 Advanced Usage

Custom Models

# Test different models
python benchmark_runner.py --model "mistral:7b"
python benchmark_runner.py --model "deepseek-coder:6.7b"

Batch Processing

#!/bin/bash
# Run multiple model comparisons
models=("codellama:7b" "codellama:13b" "deepseek-coder:6.7b")

for model in "${models[@]}"; do
    echo "Testing $model..."
    python benchmark_runner.py --model "$model" --max-samples 50
done

Result Analysis

from results_manager import ResultsManager

# Load and compare results
rm = ResultsManager("results")
latest = rm.get_latest_results("humaneval")
print(f"Pass@1: {latest['summary']['pass_at_1']:.3f}")

# Generate comparison report
files = rm.list_result_files()
rm.save_comparison_report(files, "model_comparison.csv")

Custom Benchmarks

# Extend for custom benchmarks
class CustomBenchmarkRunner:
    def __init__(self, ollama_client, data_loader, executor):
        self.client = ollama_client
        # ... implement custom logic

📈 Performance Optimization

For Faster Benchmarking

# Reduce samples for quick testing
benchmarks:
  humaneval:
    max_samples: 20
  mbpp:
    max_samples: 50

# Optimize generation parameters
ollama:
  parameters:
    num_predict: 256    # Shorter responses
    temperature: 0.1    # More deterministic

For Comprehensive Evaluation

# Full evaluation settings
benchmarks:
  humaneval:
    max_samples: null   # All problems
  mbpp:
    max_samples: null

# Multiple generations for Pass@k
execution:
  generations: 10       # For accurate Pass@10 metrics

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Development Setup

# Install development dependencies
pip install -r requirements-dev.txt

# Run tests
python -m pytest tests/

# Format code
black . && isort .

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

OpenAI for the HumanEval benchmark
Google Research for the MBPP benchmark
Ollama for the local LLM serving platform
Hugging Face for dataset hosting and tools

📞 Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Documentation: This README and inline code comments

🏷️ Citation

If you use this benchmarking system in your research, please cite:

@software{llm_coding_benchmarks,
  title={LLM Coding Benchmarks with Ollama},
  author={Your Name},
  year={2025},
  url={https://github.com/your-username/bench-code}
}

⭐ Star this repo if you find it useful!

🐛 Found a bug? Please report it in Issues

💡 Have suggestions? Start a Discussion

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
benchmark_runner.py		benchmark_runner.py
code_executor.py		code_executor.py
config.yaml		config.yaml
data_loader.py		data_loader.py
debug_test_execution.py		debug_test_execution.py
humaneval_runner.py		humaneval_runner.py
mbpp_runner.py		mbpp_runner.py
ollama_client.py		ollama_client.py
prompt_logger.py		prompt_logger.py
requirements.txt		requirements.txt
results_manager.py		results_manager.py
test_fixes.py		test_fixes.py
view_prompts.py		view_prompts.py

Folders and files

Latest commit

History

Repository files navigation

🧪 LLM Coding Benchmarks with Ollama

🚀 Features

📦 Installation

Prerequisites

⚡ Quick Start

1. Configure Your Setup

2. Run Your First Benchmark

3. View Results

📋 Benchmark Details

HumanEval

MBPP (Mostly Basic Programming Problems)

🎯 Command Line Options

📊 Results and Analysis

Output Formats

JSON Files (Complete Results)

CSV Files (Summary Data)

Prompt/Response Logs

Example Output

🔍 Viewing and Analyzing Results

Prompt/Response Viewer

Debug and Validation Tools

⚙️ Configuration

Ollama Settings

Benchmark Configuration

Execution Settings

Logging Configuration

🗂️ Project Structure

🛡️ Safety Features

🔧 Troubleshooting

Common Issues

1. Ollama Connection Failed

2. Model Not Found

3. Resource Limit Errors (macOS)

4. Memory/Timeout Issues

5. Dataset Download Issues

Debug Mode

🚀 Advanced Usage

Custom Models

Batch Processing

Result Analysis

Custom Benchmarks

📈 Performance Optimization

For Faster Benchmarking

For Comprehensive Evaluation

🤝 Contributing

Development Setup

📄 License

🙏 Acknowledgments

📞 Support

🏷️ Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages