A comparative genomics pipeline for identifying and characterising tissue-specific and stress-responsive long non-coding RNAs (lncRNAs) in conifers using minimap2, bedtools, and GO/KEGG enrichment analysis.
Developed as part of an MSc thesis at Umeå University, initially applied to Pinus sylvestris under cold and drought stress conditions across needle and root tissues. Being extended to Picea abies (Norway spruce) including stress response and embryogenesis samples.
Contact Information:
- Email: kvs.ms.2512@gmail.com
- GitHub: KvS-25
- Pipeline Overview
- Requirements
- Installation
- Usage
- Automated Workflow (Snakemake)
- Output Structure
- Multi-species Analysis
- Citation
- License
- Micromamba or Conda
- SLURM workload manager (for alignment step)
- Internet access (for KEGG pathway name download)
1. Clone the repository:
git clone https://github.com/KvS-25/comparative-lncRNA-pipeline.git
cd comparative-lncRNA-pipeline2. Install Snakemake and the SLURM executor plugin:
micromamba create -n snakemake -c conda-forge -c bioconda snakemake
micromamba activate snakemake
pip install snakemake-executor-plugin-slurm3. Ensure conda ≥ 24.7.1 is available in the snakemake environment:
micromamba install "conda>=24.7.1" -c conda-forge4. Create conda environments:
micromamba env create -f envs/alignment.yaml
micromamba env create -f envs/goanalysis.yaml5. Set up config:
cp config/config.yaml.template config/config.yaml
nano config/config.yaml # fill in your pathsRun scripts in order:
# Step 1: Align (SLURM)
sbatch scripts/01_align.sh
# Step 2: Multi-sample comparison (login node)
bash scripts/02_multiinter.sh
# Step 3: GO enrichment (login node)
micromamba activate goanalysis
Rscript scripts/03_go_analysis.R
# Step 4: KEGG enrichment (login node)
Rscript scripts/04_kegg_analysis.R
# Step 5: Generate plots (login node)
Rscript scripts/05_plots.Rmicromamba activate snakemake
# Dry run first — always do this
snakemake --config species=pine --dry-run --cores 4
# Run on SLURM cluster — pine
snakemake --config species=pine --profile profiles/slurm --use-conda
# Run on SLURM cluster — spruce
snakemake --config species=spruce --profile profiles/slurm --use-conda
# Run locally
snakemake --config species=pine --cores 4 --use-condaIf jobs fail and leave a lock:
snakemake --config species=pine --profile profiles/slurm --use-conda --unlock
snakemake --config species=pine --profile profiles/slurm --use-conda --rerun-incompleteresults/
├── paf/ # minimap2 alignment output
├── bed/ # converted BED files
├── fasta/ # extracted FASTA sequences
├── GO_analysis/ # GO enrichment results
│ ├── gene_to_GO.txt
│ ├── mstrg_to_refgene.txt
│ ├── *_refgenes.txt
│ └── *_GO_enrichment.txt
├── KEGG/ # KEGG pathway results
│ ├── gene_to_KEGG.txt
│ ├── kegg_pathway_names.txt
│ └── *_KEGG_enrichment.txt
├── plots/ # all figures
│ ├── upset_plot.png
│ ├── region_counts_bar.png
│ ├── GO_bar_*.png
│ └── KEGG_bubble_*.png
├── multiinter_output.bed
├── conserved.bed
├── needle_specific.bed
├── root_specific.bed
├── cold_specific.bed
└── drought_specific.bed
The pipeline is designed to be species-agnostic. It has been applied to Pinus sylvestris and is being extended to Picea abies (Norway spruce) for both stress response and embryogenesis comparisons.
- Obtain candidate lncRNA FASTAs using Plant LncRNA Pipeline v2
- Obtain a reference transcriptome and eggNOG-mapper annotation for your species
- Copy and update the config:
cp config/config.yaml.template config/config.yaml
# Update species, genome paths, sample names and output directory- Run as normal — the pipeline requires no other changes
| Code | Meaning |
|---|---|
| PCN | Pine Cold Needle |
| PCR | Pine Cold Root |
| PDN | Pine Drought Needle |
| PDR | Pine Drought Root |
| SCN | Spruce Cold Needle |
| SCR | Spruce Cold Root |
| SDN | Spruce Drought Needle |
| SDR | Spruce Drought Root |
| SZE | Spruce Zygotic Embryo |
| SSE | Spruce Somatic Embryo |
Note for embryogenesis or other experimental designs: The awk filters in the multiinter step are automatically generated based on sample name conventions. For other designs update the filter logic accordingly. See
docs/usage.mdfor details.
To compare pine and spruce results, run the pipeline separately for each species with separate output directories. GO and KEGG enrichment results can be compared directly between species. For a combined multi-sample analysis:
bedtools multiinter \
-i results_pine/bed/*.bed results_spruce/bed/*.bed \
-names PCN PCR PDN PDR SCN SCR SDN SDR \
> results_combined/multiinter_output.bedPlease see CITATIONS.md for full citation information.
MIT License — free to use and modify with attribution.