Comparison of Classical Graph Measures and Graph Neural Networks for Essential Protein Prediction

Graph Mining Project

Institution: Isfahan University of Technology

Instructor: Dr. Zeinab Maleki

Date: February 2026

📌 Abstract

Predicting essential proteins—those indispensable for an organism's survival—is a fundamental challenge in systems biology. This project utilizes the Saccharomyces cerevisiae (Yeast) PPI network to perform a comparative analysis between six classical graph centrality measures and five Graph Neural Network (GNN) architectures.

By integrating topology-aware features and employing a cluster-aware deep learning pipeline, we demonstrate that GNNs significantly outperform classical heuristics. Specifically, Weighted GatedGCN achieves the highest global performance (Test AUC: 0.802). Furthermore, using Spectral Clustering and VGAE-based community detection, we reveal that architecture suitability is context-dependent: GIN excels in sparse, structural complexes ("Metabolic Factories"), while GCN dominates in dense, homophilic modules ("Protein Biogenesis").

🛠 Methodology

1. Dataset Construction

Source: STRING v12.0 PPI network (Saccharomyces cerevisiae).
Filtering: High-confidence interactions only (Combined Score $\ge 700$).
Labels: Essentiality data from pcbi.1008730.s008.xlsx (17.27% essential proteins).
Graph Stats: 857 nodes, 8,356 edges (after removing isolates).

2. Feature Engineering

For GNN inputs, purely topological node features were extracted and standardized:

Degree Centrality: (Count-based)
Clustering Coefficient: (Local connectivity)
Core Number: (k-core decomposition index)

3. Classical Centrality Measures

We implemented the following measures using NetworkX and custom algorithms:

Degree Centrality
PageRank (Unweighted & Weighted)
Eigenvector Centrality
Closeness Centrality
Betweenness Centrality (Approximate, $k=100$)
Random Walk with Restart (RWR): Captures local and global topology ($r=0.3$).

4. Graph Neural Network Architectures

Implemented using PyTorch Geometric (PyG) with a unified training pipeline (Adam optimizer, Class-weighted CrossEntropy):

GCN: Spectral smoothing, best for dense/homophilic regions.
GraphSAGE: Inductive concatenation of neighbor features.
GATv2: Attention-based aggregation (dynamic weighting).
GIN: Isomorphism-based, sum aggregation; ideal for sparse structures.
GatedGCN (Weighted): Uses edge gating to filter noisy interactions.

5. Cluster-Aware Analysis

Two clustering strategies were employed to analyze local model performance:

Spectral Clustering: Topology-based partitioning via Normalized Laplacian.
VGAE + Louvain: Functional communities derived from Variational Graph Autoencoder embeddings refined by Louvain optimization.

📊 Key Results

Global Performance Comparison

Model Category	Method	Best Metric	Interpretation
Classical	Betweenness Centrality	AUC $\approx$ 0.597	Global "bridging" is more informative than local degree.
Classical	Precision@100	0.26	Only ~26 of the top 100 ranked proteins are actually essential.
GNN	Weighted GatedGCN	AUC $\approx$ 0.802	State-of-the-art performance by leveraging edge weights.
GNN	GCN	High Recall	Performs well globally due to general network homophily.

Cluster-Specific Findings

Our spectral analysis revealed distinct structural preferences:

"Metabolic Factory" Clusters (Sparse):
- Winner: GIN
- Reason: Sparse structures require distinguishing graph isomorphisms and counting neighbors, which GIN's sum-aggregator handles best.
"Protein Biogenesis" / "Lipid Refinery" (Dense):
- Winner: GCN
- Reason: Dense, homophilic regions benefit from GCN's mean-aggregation smoothing effect.

🚀 Usage

Prerequisites

Ensure you have the following libraries installed:

pip install numpy pandas networkx scikit-learn matplotlib seaborn torch torch-geometric

Running the Code

Clone the repository.
Place the dataset files (4932.protein.links.v12.0.csv and pcbi.1008730.s008.xlsx) in the working directory (or update paths in the notebook).
Open codes.ipynb in Jupyter Lab or Google Colab.
Run all cells to reproduce:
- Data preprocessing and graph construction.
- Classical centrality calculation and threshold optimization.
- GNN training (5 architectures).
- Clustering analysis and visualizations.

Authors

Zahra Tavakoli - zahratavakoli763@gmail.com
Mahsa Tavassoli - mtk.mahsa04@gmail.com

Data Sources:

STRING v12.0: https://string-db.org/
Essentiality Labels: Supplementary data from pcbi.1008730.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
4932.protein.links.v12.0.csv		4932.protein.links.v12.0.csv
Final Report.pdf		Final Report.pdf
README.md		README.md
codes.ipynb		codes.ipynb
pcbi.1008730.s008.xlsx		pcbi.1008730.s008.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Comparison of Classical Graph Measures and Graph Neural Networks for Essential Protein Prediction

📌 Abstract

🛠 Methodology

1. Dataset Construction

2. Feature Engineering

3. Classical Centrality Measures

4. Graph Neural Network Architectures

5. Cluster-Aware Analysis

📊 Key Results

Global Performance Comparison

Cluster-Specific Findings

🚀 Usage

Prerequisites

Running the Code

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Comparison of Classical Graph Measures and Graph Neural Networks for Essential Protein Prediction

📌 Abstract

🛠 Methodology

1. Dataset Construction

2. Feature Engineering

3. Classical Centrality Measures

4. Graph Neural Network Architectures

5. Cluster-Aware Analysis

📊 Key Results

Global Performance Comparison

Cluster-Specific Findings

🚀 Usage

Prerequisites

Running the Code

Authors

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages