[Feature Request] ROUGE Score Drops Due to <think> Tags in LLM Outputs (Internal Reasoning)

**Is your feature request related to a problem? Please describe.**
Recent LLMs such as Qwen 3 and DeepSeek R1 use <think> tags to include intermediate reasoning steps in their responses. However, this reasoning content is also included in the generated output used for evaluation, which leads to unfairly low ROUGE scores.

**Describe the solution you'd like**
I’d like the evaluation step to exclude content wrapped in <think>...</think> tags. This would allow evaluation to focus only on the final answer intended for users, improving the fairness and accuracy of ROUGE-based scoring.

**Describe alternatives you've considered**
Right now, I'm using a simple workaround by modifying the generation output via regex filtering. Specifically, I patched the following line in:

.venv/lib/python3.10/site-packages/autorag/nodes/generator/llama_index_llm.py, line 84:

```python
generated_texts = list(map(lambda x: re.sub(r'<think>.*?</think>\s*', '', x.text, flags=re.DOTALL).strip(), results))
```
This strips out the <think> blocks before evaluation. While it works for now, I'm not sure if this is the best long-term solution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request] ROUGE Score Drops Due to <think> Tags in LLM Outputs (Internal Reasoning) #1133

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature Request] ROUGE Score Drops Due to <think> Tags in LLM Outputs (Internal Reasoning) #1133

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions