Is your feature request related to a problem? Please describe.
Recent LLMs such as Qwen 3 and DeepSeek R1 use tags to include intermediate reasoning steps in their responses. However, this reasoning content is also included in the generated output used for evaluation, which leads to unfairly low ROUGE scores.
Describe the solution you'd like
I’d like the evaluation step to exclude content wrapped in ... tags. This would allow evaluation to focus only on the final answer intended for users, improving the fairness and accuracy of ROUGE-based scoring.
Describe alternatives you've considered
Right now, I'm using a simple workaround by modifying the generation output via regex filtering. Specifically, I patched the following line in:
.venv/lib/python3.10/site-packages/autorag/nodes/generator/llama_index_llm.py, line 84:
generated_texts = list(map(lambda x: re.sub(r'<think>.*?</think>\s*', '', x.text, flags=re.DOTALL).strip(), results))
This strips out the blocks before evaluation. While it works for now, I'm not sure if this is the best long-term solution.
Is your feature request related to a problem? Please describe.
Recent LLMs such as Qwen 3 and DeepSeek R1 use tags to include intermediate reasoning steps in their responses. However, this reasoning content is also included in the generated output used for evaluation, which leads to unfairly low ROUGE scores.
Describe the solution you'd like
I’d like the evaluation step to exclude content wrapped in ... tags. This would allow evaluation to focus only on the final answer intended for users, improving the fairness and accuracy of ROUGE-based scoring.
Describe alternatives you've considered
Right now, I'm using a simple workaround by modifying the generation output via regex filtering. Specifically, I patched the following line in:
.venv/lib/python3.10/site-packages/autorag/nodes/generator/llama_index_llm.py, line 84:
This strips out the blocks before evaluation. While it works for now, I'm not sure if this is the best long-term solution.