docs: remove pre-IVR validation and update readme with v2 benchmark results (#769)

ajbozarth · web-flow · commit 951145d6fc6e · 2026-04-01T15:26:45.000Z
Signed-off-by: Alex Bozarth &lt;ajbozart@us.ibm.com&gt;
diff --git a/docs/examples/instruct_validate_repair/qiskit_code_validation/README.md b/docs/examples/instruct_validate_repair/qiskit_code_validation/README.md
@@ -5,10 +5,9 @@ This example demonstrates using Mellea's Instruct-Validate-Repair (IVR) pattern
 ## What This Example Does
 
 Takes a prompt containing deprecated Qiskit code and:
-1. Detects QKT violations in the input code
-2. Passes those violations to the LLM as context
-3. Generates corrected code that passes QKT validation
-4. Automatically repairs the code if validation fails (up to 10 attempts)
+1. Generates corrected code using the LLM
+2. Validates the output against QKT rules
+3. Automatically repairs the code if validation fails (up to 10 attempts)
 
 ## Quick Start
 
@@ -29,10 +28,9 @@ Dependencies (`mellea`, `flake8-qiskit-migration`) are automatically installed.
 
 ### The IVR Pipeline
 
-1. **Pre-condition validation**: Validates the input prompt and any code it contains
-2. **Instruction**: LLM generates code following structured requirements
-3. **Post-condition validation**: Validates generated code against QKT rules (see [Qiskit Migration Guide](https://docs.quantum.ibm.com/api/migration-guides))
-4. **Repair loop**: Automatically repairs code that fails validation (up to 10 attempts)
+1. **Instruction**: LLM generates code following structured requirements
+2. **Post-condition validation**: Validates generated code against QKT rules (see [Qiskit Migration Guide](https://docs.quantum.ibm.com/api/migration-guides))
+3. **Repair loop**: Automatically repairs code that fails validation (up to 10 attempts)
 
 ### Sampling Strategies
 
@@ -47,20 +45,20 @@ To switch strategies, edit the `use_multiturn_strategy` variable in `test_qiskit
 
 #### Strategy Performance Comparison
 
-Benchmarks on `mistral-small-3.2-24b-qiskit` model, no system prompt:
+Benchmarks on `mistral-small-3.2-24b-qiskit` model:
 
 | Dataset | Strategy | First Pass (QKT) | Post-Repair (QKT) |
 |---------|----------|------------|-------------|
-| **QHE** | RepairTemplate | 98.0% | **100%** |
-|         | MultiTurn | **100%** | **100%** |
-| **QKT** | RepairTemplate | 98.0% | **100%** |
-|         | MultiTurn | 93.3% | **100%** |
+| **QHE** | RepairTemplate | 97.4% | **100%** |
+|         | MultiTurn | 95.4% | **100%** |
+| **QKT** | RepairTemplate | 88.9% | **100%** |
+|         | MultiTurn | **97.8%** | **100%** |
 
 **Datasets:**
 - **QHE** (QiskitHumanEval): 151 general Qiskit code generation tasks
 - **QKT**: 45 Qiskit version migration tasks requiring fixes to deprecated APIs
 
-**Note:** Pass rates measure whether generated code passes QKT validation rules, not whether the code correctly solves the prompt. On QHE, the model achieves ~32.5% correctness when running the QHE check() test suite against the generated code. Full benchmark data and analysis are available in @ajbozarth's [toolbox repo](https://github.com/ajbozarth/toolbox/tree/main/mellea/qiskit_code_validation/benchmarking).
+**Note:** Pass rates measure whether generated code passes QKT validation rules, not whether the code correctly solves the prompt. On QHE, the model achieves ~27.8% correctness when running the QHE check() test suite against the generated code. Full benchmark data and analysis are available in @ajbozarth's [toolbox repo](https://github.com/ajbozarth/toolbox/tree/main/mellea/qiskit_code_validation/benchmarking).
 
 ### Code Structure
 
diff --git a/docs/examples/instruct_validate_repair/qiskit_code_validation/qiskit_code_validation.py b/docs/examples/instruct_validate_repair/qiskit_code_validation/qiskit_code_validation.py
@@ -93,18 +93,6 @@ def generate_validated_qiskit_code(
     Returns:
         Tuple of (generated_code, success, attempts_used)
     """
-    # Pre-validate input code if present — include violations as context rather than failing
-    is_valid, error_msg = validate_input_code(prompt)
-    if not is_valid:
-        print(
-            f"Input code has QKT violations, including as context for LLM: {error_msg}"
-        )
-        prompt = (
-            f"{prompt}\n\n"
-            f"Note: the code above has the following Qiskit migration issues that must be fixed:\n"
-            f"{error_msg}"
-        )
-
     # Only pass optional kwargs if they have values — avoids passing None to m.instruct()
     extra: dict = {}
     if grounding_context: