Graph Mining Project
Institution: Isfahan University of Technology
Instructor: Dr. Zeinab Maleki
Date: February 2026
Predicting essential proteins—those indispensable for an organism's survival—is a fundamental challenge in systems biology. This project utilizes the Saccharomyces cerevisiae (Yeast) PPI network to perform a comparative analysis between six classical graph centrality measures and five Graph Neural Network (GNN) architectures.
By integrating topology-aware features and employing a cluster-aware deep learning pipeline, we demonstrate that GNNs significantly outperform classical heuristics. Specifically, Weighted GatedGCN achieves the highest global performance (Test AUC: 0.802). Furthermore, using Spectral Clustering and VGAE-based community detection, we reveal that architecture suitability is context-dependent: GIN excels in sparse, structural complexes ("Metabolic Factories"), while GCN dominates in dense, homophilic modules ("Protein Biogenesis").
- Source: STRING v12.0 PPI network (Saccharomyces cerevisiae).
-
Filtering: High-confidence interactions only (Combined Score
$\ge 700$ ). -
Labels: Essentiality data from
pcbi.1008730.s008.xlsx(17.27% essential proteins). - Graph Stats: 857 nodes, 8,356 edges (after removing isolates).
For GNN inputs, purely topological node features were extracted and standardized:
- Degree Centrality: (Count-based)
- Clustering Coefficient: (Local connectivity)
- Core Number: (k-core decomposition index)
We implemented the following measures using NetworkX and custom algorithms:
- Degree Centrality
- PageRank (Unweighted & Weighted)
- Eigenvector Centrality
- Closeness Centrality
- Betweenness Centrality (Approximate,
$k=100$ ) -
Random Walk with Restart (RWR): Captures local and global topology (
$r=0.3$ ).
Implemented using PyTorch Geometric (PyG) with a unified training pipeline (Adam optimizer, Class-weighted CrossEntropy):
- GCN: Spectral smoothing, best for dense/homophilic regions.
- GraphSAGE: Inductive concatenation of neighbor features.
- GATv2: Attention-based aggregation (dynamic weighting).
- GIN: Isomorphism-based, sum aggregation; ideal for sparse structures.
- GatedGCN (Weighted): Uses edge gating to filter noisy interactions.
Two clustering strategies were employed to analyze local model performance:
- Spectral Clustering: Topology-based partitioning via Normalized Laplacian.
- VGAE + Louvain: Functional communities derived from Variational Graph Autoencoder embeddings refined by Louvain optimization.
| Model Category | Method | Best Metric | Interpretation |
|---|---|---|---|
| Classical | Betweenness Centrality | AUC |
Global "bridging" is more informative than local degree. |
| Classical | Precision@100 | 0.26 | Only ~26 of the top 100 ranked proteins are actually essential. |
| GNN | Weighted GatedGCN | AUC |
State-of-the-art performance by leveraging edge weights. |
| GNN | GCN | High Recall | Performs well globally due to general network homophily. |
Our spectral analysis revealed distinct structural preferences:
- "Metabolic Factory" Clusters (Sparse):
- Winner: GIN
- Reason: Sparse structures require distinguishing graph isomorphisms and counting neighbors, which GIN's sum-aggregator handles best.
- "Protein Biogenesis" / "Lipid Refinery" (Dense):
- Winner: GCN
- Reason: Dense, homophilic regions benefit from GCN's mean-aggregation smoothing effect.
Ensure you have the following libraries installed:
pip install numpy pandas networkx scikit-learn matplotlib seaborn torch torch-geometric
- Clone the repository.
- Place the dataset files (
4932.protein.links.v12.0.csvandpcbi.1008730.s008.xlsx) in the working directory (or update paths in the notebook). - Open
codes.ipynbin Jupyter Lab or Google Colab. - Run all cells to reproduce:
- Data preprocessing and graph construction.
- Classical centrality calculation and threshold optimization.
- GNN training (5 architectures).
- Clustering analysis and visualizations.
- Zahra Tavakoli - zahratavakoli763@gmail.com
- Mahsa Tavassoli - mtk.mahsa04@gmail.com
Data Sources:
- STRING v12.0: https://string-db.org/
- Essentiality Labels: Supplementary data from pcbi.1008730.