Inconsistency in GSM8K training data files and sample counts across different models

Hi there,

Thank you for your impressive work on this project.

While exploring the dataset structure, I noticed some inconsistencies in the training data files for the GSM8K benchmark across different models. While most models seem to align with the standard 7,472 samples (usually consisting of 2 files), I found the following discrepancies in specific directories:

1.  **`gsm8k/Llama-2-70b-chat-hf`**:
    This directory contains **4 files** instead of the expected 2. It includes outputs and predictions for both "2085" and "7472":
    * `run_2085_outputs.pkl` / `run_2085_predictions.npy`
    * `run_7472_outputs.pkl` / `run_7472_predictions.npy`

2.  **`gsm8k/Llama-2-7b-chat-hf/train`**:
    The files here correspond to a count of **4,000**, rather than the full set:
    * `run_4000_outputs.pkl`
    * `run_4000_predictions.npy`

3.  **`gsm8k/gemma-7b/train`**:
    The files here correspond to a count of **6,980**:
    * `run_6980_outputs.pkl`
    * `run_6980_predictions.npy`

Could you please clarify the reasoning behind these different sample counts and file structures? Specifically:
* For **Llama-2-70b**, should I be using the `7472` files and ignoring the `2085` ones?
* For **Llama-2-7b** and **Gemma-7b**, do the files (`4000` and `6980`) represent the complete intended training set for this project, or are they partial checkpoints/subsets?

Any guidance on which files are the correct ones to use for reproduction would be greatly appreciated.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistency in GSM8K training data files and sample counts across different models #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inconsistency in GSM8K training data files and sample counts across different models #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions