Skip to content

ztavakolii/Comparison-of-Different-Methods-for-Predicting-Essential-Proteins-in-Yeast-PPI-Network

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Comparison of Classical Graph Measures and Graph Neural Networks for Essential Protein Prediction

Graph Mining Project

Institution: Isfahan University of Technology

Instructor: Dr. Zeinab Maleki

Date: February 2026

📌 Abstract

Predicting essential proteins—those indispensable for an organism's survival—is a fundamental challenge in systems biology. This project utilizes the Saccharomyces cerevisiae (Yeast) PPI network to perform a comparative analysis between six classical graph centrality measures and five Graph Neural Network (GNN) architectures.

By integrating topology-aware features and employing a cluster-aware deep learning pipeline, we demonstrate that GNNs significantly outperform classical heuristics. Specifically, Weighted GatedGCN achieves the highest global performance (Test AUC: 0.802). Furthermore, using Spectral Clustering and VGAE-based community detection, we reveal that architecture suitability is context-dependent: GIN excels in sparse, structural complexes ("Metabolic Factories"), while GCN dominates in dense, homophilic modules ("Protein Biogenesis").

🛠 Methodology

1. Dataset Construction

  • Source: STRING v12.0 PPI network (Saccharomyces cerevisiae).
  • Filtering: High-confidence interactions only (Combined Score $\ge 700$).
  • Labels: Essentiality data from pcbi.1008730.s008.xlsx (17.27% essential proteins).
  • Graph Stats: 857 nodes, 8,356 edges (after removing isolates).

2. Feature Engineering

For GNN inputs, purely topological node features were extracted and standardized:

  • Degree Centrality: (Count-based)
  • Clustering Coefficient: (Local connectivity)
  • Core Number: (k-core decomposition index)

3. Classical Centrality Measures

We implemented the following measures using NetworkX and custom algorithms:

  • Degree Centrality
  • PageRank (Unweighted & Weighted)
  • Eigenvector Centrality
  • Closeness Centrality
  • Betweenness Centrality (Approximate, $k=100$)
  • Random Walk with Restart (RWR): Captures local and global topology ($r=0.3$).

4. Graph Neural Network Architectures

Implemented using PyTorch Geometric (PyG) with a unified training pipeline (Adam optimizer, Class-weighted CrossEntropy):

  • GCN: Spectral smoothing, best for dense/homophilic regions.
  • GraphSAGE: Inductive concatenation of neighbor features.
  • GATv2: Attention-based aggregation (dynamic weighting).
  • GIN: Isomorphism-based, sum aggregation; ideal for sparse structures.
  • GatedGCN (Weighted): Uses edge gating to filter noisy interactions.

5. Cluster-Aware Analysis

Two clustering strategies were employed to analyze local model performance:

  1. Spectral Clustering: Topology-based partitioning via Normalized Laplacian.
  2. VGAE + Louvain: Functional communities derived from Variational Graph Autoencoder embeddings refined by Louvain optimization.

📊 Key Results

Global Performance Comparison

Model Category Method Best Metric Interpretation
Classical Betweenness Centrality AUC $\approx$ 0.597 Global "bridging" is more informative than local degree.
Classical Precision@100 0.26 Only ~26 of the top 100 ranked proteins are actually essential.
GNN Weighted GatedGCN AUC $\approx$ 0.802 State-of-the-art performance by leveraging edge weights.
GNN GCN High Recall Performs well globally due to general network homophily.

Cluster-Specific Findings

Our spectral analysis revealed distinct structural preferences:

  • "Metabolic Factory" Clusters (Sparse):
    • Winner: GIN
    • Reason: Sparse structures require distinguishing graph isomorphisms and counting neighbors, which GIN's sum-aggregator handles best.
  • "Protein Biogenesis" / "Lipid Refinery" (Dense):
    • Winner: GCN
    • Reason: Dense, homophilic regions benefit from GCN's mean-aggregation smoothing effect.

🚀 Usage

Prerequisites

Ensure you have the following libraries installed:

pip install numpy pandas networkx scikit-learn matplotlib seaborn torch torch-geometric

Running the Code

  1. Clone the repository.
  2. Place the dataset files (4932.protein.links.v12.0.csv and pcbi.1008730.s008.xlsx) in the working directory (or update paths in the notebook).
  3. Open codes.ipynb in Jupyter Lab or Google Colab.
  4. Run all cells to reproduce:
    • Data preprocessing and graph construction.
    • Classical centrality calculation and threshold optimization.
    • GNN training (5 architectures).
    • Clustering analysis and visualizations.

Authors

Data Sources:

  1. STRING v12.0: https://string-db.org/
  2. Essentiality Labels: Supplementary data from pcbi.1008730.

Releases

No releases published

Packages

 
 
 

Contributors