Hi there,
Thank you for your impressive work on this project.
While exploring the dataset structure, I noticed some inconsistencies in the training data files for the GSM8K benchmark across different models. While most models seem to align with the standard 7,472 samples (usually consisting of 2 files), I found the following discrepancies in specific directories:
-
gsm8k/Llama-2-70b-chat-hf:
This directory contains 4 files instead of the expected 2. It includes outputs and predictions for both "2085" and "7472":
run_2085_outputs.pkl / run_2085_predictions.npy
run_7472_outputs.pkl / run_7472_predictions.npy
-
gsm8k/Llama-2-7b-chat-hf/train:
The files here correspond to a count of 4,000, rather than the full set:
run_4000_outputs.pkl
run_4000_predictions.npy
-
gsm8k/gemma-7b/train:
The files here correspond to a count of 6,980:
run_6980_outputs.pkl
run_6980_predictions.npy
Could you please clarify the reasoning behind these different sample counts and file structures? Specifically:
- For Llama-2-70b, should I be using the
7472 files and ignoring the 2085 ones?
- For Llama-2-7b and Gemma-7b, do the files (
4000 and 6980) represent the complete intended training set for this project, or are they partial checkpoints/subsets?
Any guidance on which files are the correct ones to use for reproduction would be greatly appreciated.
Thanks!
Hi there,
Thank you for your impressive work on this project.
While exploring the dataset structure, I noticed some inconsistencies in the training data files for the GSM8K benchmark across different models. While most models seem to align with the standard 7,472 samples (usually consisting of 2 files), I found the following discrepancies in specific directories:
gsm8k/Llama-2-70b-chat-hf:This directory contains 4 files instead of the expected 2. It includes outputs and predictions for both "2085" and "7472":
run_2085_outputs.pkl/run_2085_predictions.npyrun_7472_outputs.pkl/run_7472_predictions.npygsm8k/Llama-2-7b-chat-hf/train:The files here correspond to a count of 4,000, rather than the full set:
run_4000_outputs.pklrun_4000_predictions.npygsm8k/gemma-7b/train:The files here correspond to a count of 6,980:
run_6980_outputs.pklrun_6980_predictions.npyCould you please clarify the reasoning behind these different sample counts and file structures? Specifically:
7472files and ignoring the2085ones?4000and6980) represent the complete intended training set for this project, or are they partial checkpoints/subsets?Any guidance on which files are the correct ones to use for reproduction would be greatly appreciated.
Thanks!