Skip to content

RAC examples with empty true_labels are silently ignored, leading to incorrect classification #21

@alexwilson1

Description

@alexwilson1

Summary

When using the ZeroShotClassificationPipeline from gliclass, any examples in rac_examples that contain empty true_labels are silently discarded. This causes serious calibration issues and inflated confidence in incorrect predictions, especially in NLI-style classification tasks.

The situation worsens even when only a single positive true_label is added — the output becomes biased due to the lack of negative or neutral signals. The pipeline ends up treating this lone positive example as sufficient evidence, ignoring any counter-examples with no true labels.


🔬 Minimal Working Example

❌ With rac_examples — incorrect, high-confidence result

from transformers import AutoTokenizer
from gliclass import GLiClassModel, ZeroShotClassificationPipeline 
import torch

model_str = "knowledgator/gliclass-base-v2.0-rac-init"

model = GLiClassModel.from_pretrained(model_str)
tokenizer = AutoTokenizer.from_pretrained(model_str)

device = 'mps' if torch.backends.mps.is_available() else 'cuda:0' if torch.cuda.is_available() else 'cpu'

pipeline = ZeroShotClassificationPipeline(model, tokenizer, classification_type='multi-label', device=device)

example_1 = {
    "text": "I submitted my application last week but haven’t heard back yet.",
    "all_labels": ["this is about post-application"],
    "true_labels": ["this is about post-application"]
}

# ❌ This negative example is silently discarded
example_2 = {
    "text": "I was filling out the job application form when the site crashed.",
    "all_labels": ["this is about post-application"],
    "true_labels": []
}

premise = "The job portal crashed while I was still filling out the application."
hypotheses = ["this is about post-application"]

results = pipeline(premise, hypotheses, threshold=0.0, rac_examples=[example_1, example_2])[0]
print(results)

Output:

[{'label': 'this is about post-application', 'score': 0.9948280453681946}]

🔍 Even though the premise is about the pre-application stage, the model outputs a high-confidence score for post-application, due to the lack of counterbalancing from example_2.


🟢 Without rac_examples — correct behavior

pipeline = ZeroShotClassificationPipeline(model, tokenizer, classification_type='multi-label', device=device)

premise = "The job portal crashed while I was still filling out the application."
hypotheses = ["this is about post-application"]

results = pipeline(premise, hypotheses, threshold=0.0)[0]
print(results)

Output:

[{'label': 'this is about post-application', 'score': 0.10260037332773209}]

✅ Without the misleading calibration, the model gives a low score — as expected.


⚠️ SINGLE TRUE LABEL ADDED — Still bad behavior

Even using just a single positive RAC example (no counter-examples), we see the same high-confidence issue:

example_1 = {
    "text": "I submitted my application last week but haven’t heard back yet.",
    "all_labels": ["this is about post-application"],
    "true_labels": ["this is about post-application"]
}

results = pipeline(premise, hypotheses, threshold=0.0, rac_examples=[example_1])[0]
print(results)

Output:

[{'label': 'this is about post-application', 'score': 0.9948280453681946}]

✅ Expected Behavior

  • Examples with true_labels=[] should:
    • ❗ Act as negative signals, indicating “this text is not about the listed labels”; OR
    • ⚠️ Trigger a clear warning that the example will be ignored, so users can avoid false calibration.

💡 Why This Matters

  • In zero-shot or few-shot setups, users expect every example to contribute to the output decision.
  • Silently discarding negative or neutral examples skews predictions, especially when examples are few.
  • This reduces trust, interpretability, and can yield confidently wrong classifications — a critical issue in real-world deployments.

🧪 Environment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions