Databases

sRNA-IMP expects the database layer to be prepared ahead of pipeline execution. The pipeline config should point to the prepared reference files or indexes, but the documentation here explains the structure and build logic without relying on internal absolute paths.

Overview

The database setup used for this project is organized into the following groups: - human/ - nonhuman/miRNAs/ - nonhuman/tRNA/ - rfam/ - silva/ - mirbase/

Human References

The human setup combines miRBase-derived miRNA references with Ensembl ncRNA references.

miRNA references

The mirbase/ setup downloads: - hairpin.fa - mature.fa

Human entries are extracted into: - hsa_hairpin_miRNA.fa - hsa_mature_miRNA.fa

and Bowtie indexes are built for both.

Ensembl ncRNA references

The human README shows the use of the Ensembl human ncRNA FASTA to split records by gene_biotype, including: - lncRNA - miRNA - misc_RNA - Mt_rRNA - Mt_tRNA - ribozyme - rRNA - scaRNA - snoRNA - snRNA - sRNA - vault_RNA

From these subtype FASTAs, the setup builds consolidated references used by the pipeline: - hsa_tRNA_all - hsa_rRNA_all - hsa_otherRNA - hsa_mature_miRNA

These are the main human indexes expected by the host-specific workflow.

Non-human References

Non-human miRNA-like references

The nonhuman/miRNAs/ setup contains: - nonhuman_miRNA.fa - the Bowtie index mirna_euk_all

This is used in the non-host branch for miRNA-like sequence assignment.

Non-human tRNA references

The nonhuman/tRNA/ setup contains multiple source FASTA files from different origins, such as: - archaeal - bacterial - fungal - plant - viral - phage - plasmid - environmental - SRA-derived - chloroplast-related

These are merged into: - trna_all.fa - Bowtie index trna_all

This merged index is the main non-host tRNA reference used by the pipeline.

Rfam Setup

The rfam/ setup is more elaborate and supports both class-level lightweight classification and covariance-model-based ncRNA discovery support.

Main downloaded resources

The README indicates use of: - Rfam.fa.gz - Rfam.cm.gz - family.txt.gz - rfamseq.txt - full_region.txt.gz

Class-level Rfam reference

The class-level Rfam setup normalizes family annotations into practical groups such as: - rRNA - tRNA - miRNA - snRNA - snoRNA - antisense - ribozyme - leader - CRISPR - riboswitch - other

The documented build flow is: 1. derive family-to-class mappings from family.txt.gz 2. normalize those classes into a cleaned table 3. build a combined class FASTA with build_rfam_class_fasta.py 4. extract a reduced accession-to-taxid table for downstream summarization

Important derived files include: - rfam_classes.fa - rfam_classes.accessions.txt - rfamseq.reduced.tsv - rfamseq.reduced.named.tsv

The pipeline uses the Bowtie index built from rfam_classes.fa for fast non-host Rfam class assignment.

Full covariance models

For novel ncRNA discovery, the setup also uses the complete Rfam.cm covariance model database. The README notes compression of the covariance model file with: - cmpress

This full CM file is appropriate for cmscan-based candidate filtering.

Rfam Taxonomy Repository

The database tree also includes the rfam-taxonomy/ repository, which provides domain-specific Rfam family subsets and clan files.

Important files there include: - rfam-taxonomy.py - scripts/rfam_db.py - domains/all-domains.csv - domains/bacteria.csv - domains/eukaryota.csv - domains/viruses.csv - matching .clanin files per domain

The bundled README explains that these outputs can be used to: - create domain-specific CM subsets with cmfetch - run cmscan with matching --clanin files - support stricter domain-focused annotation strategies

This is especially useful when you want domain-specific Rfam covariance model subsets rather than the full Rfam.cm file.

SILVA

The silva/ setup includes download of SILVA SSU reference FASTA data. This supports broader rRNA-oriented reference work, even though the current native pipeline primarily uses ribodetector_cpu plus Kraken2 for the non-host rRNA branch.

Helper Scripts Used During Database Preparation and Analysis

The setup around these databases uses several local Python helper scripts that are worth documenting explicitly: - build_rfam_class_fasta.py - taxid_to_name.py - collapse_fastq_to_counted_fasta.py - sum_collapsed_bowtie_hits.py - sum_collapsed_trna_by_taxon.py - sum_collapsed_rfam_by_class_and_taxon.py - sam_fractional_counts.py - summarize_human_subtree.py

These scripts are part of the broader project tooling and support both reference preparation and downstream summarization.

Tools Mentioned in the Database Build Notes

The README files and companion repository indicate use of tools such as: - wget - gunzip / zcat - grep, cut, sort, uniq, awk, sed - seqkit - bowtie-build - cmpress - cmfetch - cmscan - curl - Python helper scripts

Practical Recommendation for Public Documentation

For public docs, it is best to describe databases by logical role and filename pattern rather than by internal infrastructure paths. A good pattern is to document: - the source database or release family - the derived FASTA or CM artifact names - the indexing tool used - which workflow consumes the result

That keeps the setup reproducible without exposing site-specific storage layout.