Locally run code similarity detection for Python files and Jupyter Notebooks
Runs pairwise comparisons between Python code from .py and .ipynb files to determine:
- Code similarity: Compares code contents after substituting variable names and running other normalizations. Uses a windowed, fingerprinting approach to come up with vector representations for the submissions, which are then compared using cosine similarity. Based primarily on http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf.
- Output similarity: Compares similarity between .ipynb output cells (again using the cosine similarity metric)
- Variable similarity: Compares similarity between all variable names, including function and class names, using Jaccard similarity calculation.
- Structure similarity: Compares the abstract syntax trees (ASTs) for different submissions using Jaccard similarity.
Example output:
{
"repoA": "https://github.com/cs544-wisc/project-6-repo-A",
"repoB": "https://github.com/cs544-wisc/project-6-repo-B",
"code": 0.0258,
"variables": 0.357,
"output": 0.9634,
"struture": 0.7879,
"visualize": "python3 display.py 'project-6-repo-A' 'project-6-repo-B'"
}Note: The code within the files must be syntactically valid since CoSi uses AST parsing to extract code structure.
- Add a new directory (e.g.
project-10) within./files/submissionscontaining individual student submission directories - Update the
.envfile; setSUBMISSIONS_ROOT_DIRas the folder name (project-10/) andFILES_IN_SUBMISSIONas the list of files to be read from each sub-directory ofSUBMISSIONS_ROOT_DIRe.g (p10.pyornb/client.py,nb/server.py). Also, set theGITHUB_PREFIXas the URL prefix to add to the directory name (used in the resulting CSV) or leave blank if not applicable - Run
python3 CoSi.py - The result is saved as a CSV in
./files/resultsidentified by the time of generation. AFAILEDcsv is also included if parsing of any files fails
The result csv contains a column to view differences between the code in two submissions. In general, to visualize the text-diff of two submissions, set the SUBMISSIONS_ROOT_DIR and FILES_IN_SUBMISSION fields in .env. Then execute python3 display.py 'repoA' 'repoB', where repoA and repoB are student directories within the SUBMISSIONS_ROOT_DIR. This will open a browser window with a side-by-side comparison of all files from FILES_IN_SUBMISSION in the two repos.
Note: If repoA and repoB are present in SUBMISSIONS_ROOT_DIR, the saved version of the files will be displayed. If any of the repo names passed is not saved locally, it will be cloned from git into SUBMISSIONS_ROOT_DIR first.
Note: The script uses diffcheck.com public APIs to retreive the diff HTML (https://api.diffchecker.com/public/text). Use of diffcheck API must abide by their API terms.
An example submission is provided for which CoSi may be executed immediately.
- To run CoSi for the example, execute
python3 CoSi.py - To view the results, see the csv file generated in
results/ - To visualize the diff between the example submission pair, run
python3 display.py 'project-6-repo-B' 'project-6-repo-A'