Skip to content

theocharistr/LLM-comparative-evaluation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 

Repository files navigation

Multi-LLM Evaluation & Deployment Readiness Benchmark

Author: Theocharis Triantafyllidis

Description

This notebook provides a comprehensive framework to compare, evaluate, and rank multiple instruction-tuned LLMs for real-world deployment scenarios. It is designed to simulate enterprise-grade model selection workflows, enabling users to assess trade-offs between model performance, efficiency, and deployment readiness.

Key Features:

  • Compare multiple LLMs (Mistral, Qwen, Hermes) across diverse prompts
  • Measure latency, verbosity, and semantic similarity of generated outputs
  • Automatically rank models to identify the most suitable candidate for deployment
  • Provide an interactive Gradio playground for real-time testing
  • Visualize evaluation results and generate CSV reports for further analysis

Use Case:
Ideal for teams and researchers aiming to benchmark models under production-like conditions, ensuring informed decisions for enterprise deployment of LLMs.


Getting Started with Colab

  1. Open the notebook in Google Colab:
    Open Notebook in Colab

  2. Make a copy to your own Google Drive:

    • Click File → Save a copy in Drive
  3. Run the cells interactively to evaluate models, visualize results, and use the Gradio playground.


About

A repository for benchmarking multiple instruction-tuned LLMs across various prompt categories. Evaluates model generation quality, inference latency, and output verbosity, and provides an interactive Gradio playground for real-time comparison. Designed for research, deployment testing, and demonstration of multi-LLM evaluation workflows.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors