Database download and setup guide for sRNA-IMP
This bundle documents and automates setup of the major databases used by sRNA-IMP.
Covered resources
Human arm
- Human miRNA reference
- Human tRNA reference
- Human rRNA reference
- Human other ncRNA reference
Non-human arm
- RiboDetector model setup (tool installation, no large external DB required)
- Kraken2 standard taxonomy database
- Bracken k-mer distribution files
- Rfam covariance models
- Optional non-human miRNA references
- Optional combined tRNA references
Novel ncRNA discovery
- Rfam covariance models
- ARAGORN / tRNAscan-SE installation via conda environment
Recommended top-level directory layout
databases/
├── human/
│ ├── mirna/
│ ├── trna/
│ ├── rrna/
│ └── otherrna/
├── nonhuman/
│ ├── kraken2/
│ ├── bracken/
│ ├── mirna/
│ ├── trna/
│ └── rfam/
└── shared/
└── rfam/
Quick-start
- Create the conda / micromamba environment that contains the required tools.
- Edit
config_examples/database_paths.example.sh - Run:
bash scripts/download_and_setup_databases.sh
Notes on data sources
Rfam
The official Rfam genome-annotation guide recommends downloading Rfam.cm.gz and Rfam.clanin, then indexing Rfam.cm with cmpress. Use cmscan --rfam --cut_ga for annotation against these models.
Kraken2
The official Kraken2 manual documents kraken2-build --standard --db <DB> for building the standard database.
Bracken
Bracken requires a built Kraken database plus generation of the database*kmer_distrib files with bracken-build.
tRNAscan-SE
tRNAscan-SE 2.0 is available from the UCSC Lowe Lab GitHub and website. It depends on Infernal.
ARAGORN
ARAGORN is also available through Bioconda, which is the simplest reproducible installation route for pipeline use.
RiboDetector
RiboDetector is distributed via Bioconda and GitHub. It is a tool rather than a large downloadable reference database.
miRNA references
For miRNA references, use either:
- miRBase (mature.fa, hairpin.fa) when you want broad species coverage
- MirGeneDB when you want a more conservative, curated animal miRNA set
Human arm reference recommendations
Human miRNA
Use one of:
- miRBase mature.fa filtered to hsa-*
- MirGeneDB mature sequences for human
Build a Bowtie index:
bowtie-build hsa_mature.fa hsa_mature
Human tRNA
Preferred sources: - GtRNAdb exports for human nuclear tRNAs - human mitochondrial tRNAs from a curated FASTA
Build:
cat human_tRNA.fa human_mt_tRNA.fa > hsa_tRNA_all.fa
bowtie-build hsa_tRNA_all.fa hsa_tRNA_all
Human rRNA
Use curated human cytosolic + mitochondrial rRNA FASTA, then build:
bowtie-build hsa_rRNA.fa hsa_rRNA
Human other ncRNA
A practical source is Ensembl ncRNA FASTA:
bowtie-build Homo_sapiens.GRCh38.ncrna.fa hsa_other_ncRNA
Non-human arm reference recommendations
Kraken2 + Bracken
- Build Kraken2 standard DB
- Build Bracken files for your read length (50 bp typical for your pipeline)
Rfam
Use the full Rfam.cm, then optionally extract a smaller subset later if runtime becomes an issue.
Optional non-human miRNA reference
Use miRBase mature.fa, exclude hsa, and optionally subset to:
- plants
- fungi
- protozoa
- viruses
Optional tRNA reference
For broader non-human tRNA screening, create a combined FASTA containing: - bacterial tRNAs - fungal / eukaryotic tRNAs - plant chloroplast tRNAs - plant mitochondrial tRNAs
Runtime / storage expectations
| Resource | Approximate size | Comment |
|---|---|---|
| Kraken2 standard DB | ~100 GB during build | largest resource |
| Rfam.cm + indices | several GB | shared across workflows |
| Bracken files | depends on Kraken DB | build once per DB |
| Bowtie miRNA/tRNA/rRNA refs | MB to low GB | lightweight |
Verification checklist
After setup, verify:
- cmpress generated Rfam.cm.i1f, i1i, i1m, i1p
- Kraken2 DB contains hash.k2d, opts.k2d, taxo.k2d
- Bracken DB contains database50mers.kmer_distrib (or matching read length)
- each Bowtie index has .1.ebwt ... .4.ebwt files
Suggested provenance file
Create a plain text manifest such as:
Rfam=15.1
Kraken2_standard_built=2026-03-27
Bracken_read_length=50
miRBase=22.1
MirGeneDB_downloaded=2026-03-27