This project aims to evaluate the performance of different large language models through a series of standardized tests. Using a Python program executed in IPython, we assess the models' ability to solve problems from "Problem Set 4," experimenting with different temperature settings to observe their impact on the generated results.
- Language: Python
- Environment: IPython
- Libraries:
csv: For handling CSV file operations.os: For interacting with the operating system.time: For time-related operations.subprocess: To run the program under controlled conditions.openai: For integration with OpenAI models.pathlib: For working with file paths.langchain_google_genai: For integration with Google Generative AI models.langchain.prompts.ChatPromptTemplate: For creating prompts for the models.langchain.output_parsers: For parsing model responses.langchain.chat_models.ChatOpenAI: For using OpenAI's chat models.
The code is available on GitHub here. All necessary documentation is included for easy review.
- Models:
- Gemini 1.0 Pro
- Gemini 1.5 Pro
- Gemini 1.5 Flash
- GPT 3.5-turbo
- GPT 4.0-turbo
- GPT-4o
- GPT 4o-mini
- Temperatures tested:
- Temperature = 0
- Temperature = 1
- Additional parameters:
top_p = 1
- Prompt: Solve problems from "Problem Set 4" with the model returning "True" for successful tests and "False" otherwise.
- Problem Selection: The first 25 problems from "Problem Set 4".
- Execution:
- The prompt was sent to the models with the indicated configuration.
- The model's response was extracted from the resulting JSON and stored in a Python file.
- The Python file was executed using
subprocesswith a 60-second time limit.
- Result Analysis:
- Counting "True" and "False" responses.
- Results were stored in a CSV file along with the execution code.
The primary notebook responsible for executing the program is located at: /Transformers/test_dataset/process_data/Solve_Extract_V2.ipynb To run the program, navigate to the directory and execute the notebook.
- Results CSV: Contains the count of "True" and "False" responses along with execution details.
- Python Code: Contains the code used for the evaluation.