This repository hosts the code for recreating analyses in the PanOryza manuscript. Code for GET_PANGENES available from: https://github.com/Ensembl/plant-scripts/blob/master/pangenes/. The code for Nipponbare merged genes is available from: https://github.com/Ensembl/plant-scripts/tree/master/scripts. The input files( .fasta and .gff format) for running GET_PANGENES are available from zenodo (https://zenodo.org/records/14772953). Else, the output files for Os4530.POR.1 (version 1.0) are also available at the zenodo repository and can be used for various downstream analyses of the pan-genes using the code available here.
To reproduce the entire analyses starting with the GET_PANGENES result, prepare various tables and intermediate files to recreate manuscript figures.
Output of get_pangenes using RPRP (MAGIC-16 accessions) as input gives out the following set of files:
- .cluster_list --> parsed in tabular format using function parse_clusters --> output table named as "df_merged"
- .matrix_genes.tr.tab --> read directly as table named "pangene_list"
- .matrix.tr.tab
- Individual clusters inside folder 'oryzasativanipponbaremerged' --> *.cds.faa files of clusters used to calculate and summarise clusters and individual protein lengths. Clusters sequence summary can be created in R using create_cluster_sum. NOTE: There are also several ways to do this using a Linux terminal. The resulting clusters sequence summary can be further parsed into a dataframe using read_parse_clusters_summary
Additional "cluster_merged" named table used at various places, created by combining "pangene_list" and "df_merged"
Interproscan tabular results for magic18 protein sequences were merged with the cluster files above. Recommended to load the workspace core_workspace.RData in R/Rstudio that will also load these Interproscan results for pan-genes (Available at zenodo). Else, core_files.R can be used to read all these files needed for downstream analysis.
To repoduce the figure-wise analysis, please refer to the scripts folder