Skip to content

[Feature Request] ROUGE Score Drops Due to <think> Tags in LLM Outputs (Internal Reasoning) #1133

@BaconTomatoDeluxe

Description

@BaconTomatoDeluxe

Is your feature request related to a problem? Please describe.
Recent LLMs such as Qwen 3 and DeepSeek R1 use tags to include intermediate reasoning steps in their responses. However, this reasoning content is also included in the generated output used for evaluation, which leads to unfairly low ROUGE scores.

Describe the solution you'd like
I’d like the evaluation step to exclude content wrapped in ... tags. This would allow evaluation to focus only on the final answer intended for users, improving the fairness and accuracy of ROUGE-based scoring.

Describe alternatives you've considered
Right now, I'm using a simple workaround by modifying the generation output via regex filtering. Specifically, I patched the following line in:

.venv/lib/python3.10/site-packages/autorag/nodes/generator/llama_index_llm.py, line 84:

generated_texts = list(map(lambda x: re.sub(r'<think>.*?</think>\s*', '', x.text, flags=re.DOTALL).strip(), results))

This strips out the blocks before evaluation. While it works for now, I'm not sure if this is the best long-term solution.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions