From Raw Reads to Cell-Type Insights: snRNA-Seq Bioinformatics Workflow and Deliverables
Single-nucleus RNA sequencing generates raw data that is several layers removed from biological insight. Between the FASTQ files off the sequencer and the cell-type-resolved differential expression results that go into a paper lies a bioinformatics pipeline with snRNA-seq-specific decisions at nearly every stage. Researchers who understand these decisions — rather than accepting default outputs — get substantially more from their data.
This article walks through the end-to-end snRNA-seq bioinformatics workflow: from raw reads through cell-type annotation, differential expression, and pathway enrichment, ending with what a typical data delivery package contains and how to interpret it. The focus throughout is on what differs from conventional scRNA-seq analysis, because treating nuclei data as if it were whole-cell data is the most common source of preventable analytical problems.
Figure 1: End-to-end snRNA-seq bioinformatics workflow from raw sequencing reads through processed data deliverables, highlighting snRNA-seq-specific decisions at the intronic counting and nuclei QC stages.
Starting with FASTQ
The output of an snRNA-seq experiment begins as FASTQ files — millions of short sequencing reads, each tagged with a cell barcode identifying which droplet the read came from and a unique molecular identifier (molecular marker) distinguishing original transcript molecules from PCR duplicates.
The first computational step is alignment and quantification. The three most widely used tools are:
- Cell Ranger (10x Genomics): Demultiplexes reads by sample index, aligns to a reference genome, corrects barcode errors, collapses molecular markers, and produces a feature-barcode count matrix
- STARsolo: Performs alignment and molecular marker counting in a single step — faster than Cell Ranger with comparable accuracy
- kallisto-bustools: Uses pseudoalignment for the fastest processing and lowest memory footprint
Genome reference selection is an early decision with downstream consequences. For human and mouse samples, pre-built 10x Genomics references derived from Ensembl or GENCODE annotations are widely used and well validated. The snRNA-seq-specific choice comes at the annotation level:
| Reference Type | Annotation Scope | Best For |
|---|---|---|
| Standard (exonic-only) | Exons only | Whole-cell scRNA-seq |
| Pre-mRNA (intron-inclusive) | Exons + introns (full gene body) | snRNA-seq — captures nuclear pre-mRNA |
For non-standard organisms, custom references can be built with cellranger mkref or STAR, provided a high-quality genome assembly and gene annotation are available.
A systematic comparison of single-cell and single-nucleus methods confirmed that snRNA-seq libraries contain a substantially higher proportion of intronic reads — a direct consequence of nuclei containing predominantly unspliced pre-mRNA rather than mature cytoplasmic mRNA. The practical implication is straightforward:
- Use the pre-mRNA reference for snRNA-seq, or explicitly enable intron-inclusive counting
- Without this adjustment, a large fraction of reads is discarded, sensitivity drops, and cell-type resolution degrades — particularly for lowly expressed genes and rare populations
Counting Introns and Exons
The intron-versus-exon distinction is the single largest difference between snRNA-seq and scRNA-seq data processing, and it affects the count matrix that all downstream analysis depends on.
In scRNA-seq, mature cytoplasmic mRNA is the dominant species, and reads map predominantly to exons. Cell Ranger's standard exonic-only reference captures this signal efficiently. In snRNA-seq, nuclei are enriched for nascent transcripts — pre-mRNA molecules that still contain introns. When these are aligned to an exonic-only reference, intronic reads are discarded, reducing total molecular marker counts and systematically underestimating expression of genes with long introns or slow splicing kinetics.
The 10x Genomics pre-mRNA reference solves this by marking the entire gene body — exons plus introns — as the counting region. Running cellranger count with the pre-mRNA reference counts any read mapping within a gene's coordinates toward that gene's molecular marker total. The increase in detected genes per nucleus is consistently substantial, often adding hundreds of genes and improving resolution of transcriptional heterogeneity within clusters.
Figure 2: Comparison of exonic-only versus pre-mRNA (intron-inclusive) reference for snRNA-seq quantification, illustrating why the pre-mRNA reference recovers substantially more transcriptional information from nuclear RNA.
For non-10x platforms, the same principle applies with tool-specific configuration:
- STARsolo: Configure with a comprehensive transcriptome annotation spanning full gene bodies (exons + introns)
- kallisto-bustools: Build an intron-inclusive reference index for pseudoalignment
In all cases, the guiding principle is the same: the nuclear transcriptome is predominantly unspliced, and the counting strategy must reflect that biology.
Filtering Out Noise
After quantification, the count matrix contains signal from genuine nuclei, ambient RNA from lysed cells, empty droplets, and occasional doublets. Filtering separates informative nuclei from noise.
Three standard metrics are used for nucleus-level QC:
- Number of molecular markers detected: Low counts suggest empty droplets or debris; very high counts may indicate doublets
- Number of genes detected: Same interpretive logic — too few or too many both warrant scrutiny
- Percentage of reads mapping to mitochondrial genes: Operates differently in snRNA-seq than scRNA-seq (see below)
The mitochondrial filter is one area where snRNA-seq and scRNA-seq diverge sharply:
| Feature | scRNA-seq | snRNA-seq |
|---|---|---|
| Mitochondrial reads expected? | Yes — cytoplasm contains mitochondria | No — cytoplasm is stripped during nuclei isolation |
| Typical % mitochondrial | 5–20% (varies by tissue) | Well under 1% |
| Elevated % interpretation | Possible cell damage or stress | Likely cytoplasmic contamination |
Because the cytoplasm is stripped during nuclei isolation, high-quality snRNA-seq nuclei should contain essentially zero mitochondrial transcripts. Any nucleus with elevated mitochondrial reads likely retains cytoplasmic contamination and should be flagged.
The QClus algorithm provides an alternative filtering approach that models the distribution of droplet-level quality metrics rather than applying fixed thresholds. Using unsupervised clustering of droplet QC profiles, QClus identifies debris-dominated droplets while retaining biologically meaningful variation. Published benchmarks showed that QClus recovered cell populations that fixed-threshold filtering would have excluded, particularly in samples with high debris load.
Doublet detection is the second major filtering step. Computational detectors identify likely doublets by constructing artificial doublet profiles from pairs of real nuclei and flagging nuclei whose transcriptomes resemble these hybrids:
- Scrublet: Simulates doublets from observed data and scores each nucleus by similarity to simulated doublets
- DoubletFinder: Generates artificial nearest-neighbor doublets and uses PCA-space distances for classification
- Seurat's simulated-doublet approach: Integrates doublet scoring within the standard Seurat workflow
A doublet rate of 2–8% is typical, depending on the number of nuclei loaded.
Normalization and Dimensionality Reduction
With clean nuclei in hand, the next steps make data comparable across nuclei and reduce thousands of gene-level measurements to a manageable set of dimensions.
Normalization corrects for differences in sequencing depth across nuclei. The two main methods are:
- SCTransform (recommended for snRNA-seq): A regularized negative binomial regression that models molecular marker counts as a function of sequencing depth and stabilizes variance across the expression range. It simultaneously identifies highly variable genes — the subset that will drive downstream clustering. SCTransform generally produces cleaner separation of clusters in snRNA-seq data because it handles the sparsity and overdispersion of molecular marker count data more effectively than simpler methods.
- LogNormalize: A simpler alternative that log-transforms normalized counts. Widely used but less effective at handling the variable sequencing depth typical of snRNA-seq libraries.
Dimensionality reduction proceeds in two stages:
- Principal component analysis (PCA) on the set of highly variable genes, identifying linear combinations that capture the major axes of variation. Typically 20–50 PCs are retained — enough to capture biological signal without incorporating noise-dominated later components. The choice is guided by an elbow plot, which shows the variance explained by each PC and identifies where additional PCs contribute diminishing returns.
- UMAP or t-SNE embedding of the PCA-reduced data into two dimensions for visualization. UMAP has largely replaced t-SNE because it preserves more global data structure — distances between clusters in a UMAP plot carry more meaning than in t-SNE. However, no embedding fully captures the complexity of single-nucleus data: UMAP plots are useful for exploration and publication figures, but all quantitative analysis should be performed in the full high-dimensional space, not on the two-dimensional embedding.
Clustering and Marker Detection
Clustering groups nuclei with similar transcriptomic profiles, defining the cell populations that all subsequent analysis interrogates.
The standard approach is graph-based clustering using the shared nearest neighbor (SNN) algorithm. The method proceeds through these steps:
- Build a k-nearest neighbor (k-NN) graph in PCA space, connecting each nucleus to its k most similar neighbors
- Weight edges by shared-neighbor overlap — the Jaccard index of neighbor sets
- Partition the graph using the Louvain or Leiden community-detection algorithm (implemented in Seurat as
FindClustersand in Scanpy assc.tl.leiden)
The key parameter is resolution — higher values produce more clusters. A heterogeneous tissue like brain warrants high-resolution clustering to resolve fine neuronal subtypes, while a more homogeneous sample may be adequately described by fewer, broader clusters.
Cluster marker detection identifies the genes that distinguish each cluster. The standard approach — FindAllMarkers in Seurat — runs a Wilcoxon rank-sum test comparing expression in the target cluster against all others. For each cluster, the output is a ranked list of genes with log fold-change values, p-values, and adjusted p-values. These markers serve two purposes:
- Cell-type annotation: Providing molecular evidence for cluster identity. A cluster expressing MBP, PLP1, and MOG is almost certainly oligodendrocytes; one enriched for CD3D, CD3E, and TRAC is T cells
- Functional characterization: Revealing the biological programs active in each cluster. Disease-associated microglia or activated fibroblasts are identified by coordinated enrichment of gene programs, not single markers
Marker-based identification is more nuanced in snRNA-seq because the relationship between nuclear transcript abundance and protein abundance is not straightforward. Transcripts rapidly exported from the nucleus — such as ribosomal and mitochondrial genes — may be underrepresented, while transcripts with long nuclear dwell times may appear enriched relative to their protein-level importance. A canonical surface marker's transcript-level expression in snRNA-seq data is a useful but imperfect indicator of cell identity.
Putting Names to Clusters
Cell-type annotation transforms clusters of transcriptionally similar nuclei into biologically meaningful populations. Three strategies are commonly used, often in combination.
1. Reference-Based Annotation
Reference-based annotation maps each nucleus to the most similar cell type in a reference atlas. Well-annotated references are available from the Human Cell Atlas, Allen Brain Atlas, Tabula Muris, Tabula Sapiens, and the BRAIN Initiative Cell Census Network.
| Tool | Approach | Key Feature |
|---|---|---|
| Seurat label-transfer | Projects query nuclei into a reference-defined PCA space | Integrates with standard Seurat workflow |
| Symphony | Efficient mapping to large references | Scales to millions of reference cells |
| SingleR | Correlation-based scoring against reference profiles | Reference-agnostic; works with any labeled dataset |
These tools are fast and reproducible, but their accuracy depends on the match between query and reference — a cortical reference will misannotate substantia nigra nuclei, and a healthy donor reference may fail to label disease-specific cell states.
2. Marker-Based Annotation
Marker-based annotation relies on known cell-type markers from literature and curated databases such as PanglaoDB and CellMarker. This approach requires more biological domain knowledge but does not depend on a matched reference atlas and can recognize cell types absent from any reference. For tissues or species without well-annotated atlases, marker-based annotation is the primary strategy.
3. Automated Classifiers
Automated tools such as CellTypist and scPred use pre-trained classifiers trained on reference data to assign labels with confidence scores. These are useful for large-scale atlasing projects but share the same reference-matching requirements as reference-based methods.
In practice, robust annotation combines strategies: a first pass with a reference-based tool identifies major lineages, manual review using canonical markers confirms or corrects questionable assignments, and low-confidence labels are flagged for deeper investigation. For disease-state or subcortical brain samples, treating reference-derived labels as hypotheses rather than conclusions is essential.
Differential Expression and Pathway Analysis
With cell types annotated, biological questions are addressed through differential expression and pathway enrichment.
Differential expression (DE) can be performed at the cluster level — comparing the same cell type between conditions — or globally. Cluster-level analysis is the more common and interpretable approach: it asks, for example, whether microglia in Alzheimer's disease donors express different genes than microglia in age-matched controls.
The current best practice is pseudobulk analysis, which aggregates molecular marker counts from all nuclei of a given cell type within each sample into one pseudobulk profile per sample per cell type. These can then be analyzed with tools originally developed for bulk RNA-seq:
| Tool | Approach | Key Strength |
|---|---|---|
| DESeq2 | Negative binomial model | Handles small sample sizes well; widely validated |
| edgeR | Empirical Bayes estimation of negative binomial dispersion | Fast; robust for larger datasets |
| limma-voom | Linear models with precision weights | Flexible design-matrix framework for complex experimental designs |
Pseudobulk analysis avoids the inflated false-positive rates that plague single-cell-resolution DE methods when sample sizes are large, and properly accounts for biological replicates.
Pathway enrichment interprets DE results in terms of biological processes and signaling pathways. After identifying differentially expressed genes, enrichment tools test whether these genes are overrepresented in curated gene sets:
- Gene Ontology (GO) terms: Biological processes, molecular functions, cellular components
- KEGG pathways: Metabolic and signaling pathways
- Reactome pathways: Curated biological pathway database
- MSigDB: Molecular Signatures Database — includes hallmark gene sets, oncogenic signatures, and immunologic signatures
Widely used tools include clusterProfiler (supports over-representation analysis and GSEA with publication-quality visualization) and Enrichr (comprehensive gene-set library collection with a web interface).
For snRNA-seq specifically, pathway results should be interpreted with awareness of nuclear transcriptome composition. Genes depleted in nuclear fractions — mitochondrial, ribosomal, and rapidly exported transcripts — will show lower enrichment scores even if their protein products are biologically active. Conversely, genes with long introns or regulated nuclear retention may show inflated enrichment. These biases mean pathway-level conclusions are most robust when the gene sets of interest are well represented in the nuclear transcriptome and when orthogonal protein-level validation is planned.
What You Receive: Deliverables and Report Interpretation
A typical snRNA-seq data analysis delivery includes several categories of output, summarized below.
| Deliverable Level | Contents | Format | Purpose |
|---|---|---|---|
| Raw data | FASTQ files, Cell Ranger (or equivalent) pipeline output, filtered feature-barcode count matrix | FASTQ, HDF5 (.h5), or Market Exchange Format (.mtx) | Starting point for custom downstream analysis in Seurat, Scanpy, or other frameworks |
| QC reports | Web summary HTML (reads, mapping rates, nuclei recovered, median genes per nucleus, sequencing saturation) + nuclei-level QC report (filtering thresholds, nuclei retained/removed at each step, distributions of key metrics) | HTML | Quality assessment and filtering documentation |
| Processed data object | Normalized, dimensionality-reduced, clustered, and annotated data | Seurat object (.rds) or AnnData object (.h5ad) | Users can regenerate UMAP plots, examine gene expression, subset nuclei, and perform additional analyses without re-running preprocessing |
| Analysis report | UMAP plots colored by cluster, cell type, and condition; heatmaps and dot plots of marker gene expression; cluster marker tables with statistics; cell-type proportions per sample; differential expression results with volcano plots and ranked gene lists | HTML or PDF | Tells a coherent story: cell types detected, defining markers, proportion shifts across conditions, transcriptional programs activated or suppressed in disease |
| Visualization outputs | Publication-ready figures + interactive HTML files for cluster exploration + raw data tables underlying each figure | PNG, PDF, interactive HTML, tabular data | Publication and collaborative exploration — non-bioinformatician team members can explore results without writing code |
| Advanced analyses (optional) | Enrichment analysis results (GO terms, KEGG pathways, MSigDB gene sets per cluster or per DE comparison) + cell-cell communication analysis (CellChat, NicheNet) | Tables, figures | Deeper biological interpretation |
Figure 3: Representative snRNA-seq data analysis deliverables, including UMAP visualization, marker gene expression heatmap, differential expression results, and pathway enrichment outputs.
For researchers who want to go further, deliveries often include enrichment analysis results — tables of enriched GO terms, KEGG pathways, or MSigDB gene sets per cluster or per DE comparison — and cell-cell communication analysis using tools such as CellChat or NicheNet.
Researchers planning snRNA-seq projects can explore CD Genomics snRNA sequencing services for experimental design through data analysis support. For studies requiring spatial context, the spatial transcriptomics services provide complementary tissue-level readouts.
Frequently Asked Questions
Q: Do I really need to use the pre-mRNA reference for snRNA-seq, or is the standard exonic reference acceptable?
A: The standard exonic reference is not recommended for snRNA-seq. Because nuclei contain predominantly unspliced pre-mRNA, intronic reads represent a substantial fraction of the library. When aligned to an exonic-only reference, these reads are discarded, reducing sensitivity across the transcriptome and systematically underestimating expression of genes with long introns or slow splicing kinetics. The practical consequence is fewer detected genes per nucleus, degraded cluster resolution, and potential loss of rare cell populations. Using the pre-mRNA reference — or enabling intron-inclusive counting in STARsolo or kallisto-bustools — recovers this signal with no downside beyond a modest increase in computational time.
Q: How many nuclei should remain after QC filtering?
A: There is no universal threshold, because the appropriate number depends on tissue type, nuclei isolation efficiency, and sequencing depth. In practice, retaining 60–90% of initially detected nuclei is typical for well-prepared samples — fresh-frozen brain or liver tissue with gentle lysis conditions often falls at the high end of this range, while fibrous or lipid-rich tissues may retain a lower fraction. More important than the absolute number is examining the distribution of QC metrics: a clean sample shows a distinct population of high-quality nuclei separated from a debris cloud in molecular marker-versus-gene-count space, rather than a single continuum. If the retained fraction drops below 50%, reviewing the isolation protocol and filtration thresholds is warranted.
Q: What clustering resolution should I use, and how do I know if it is appropriate?
A: Resolution is dataset-dependent — there is no single correct value. A heterogeneous tissue such as brain typically benefits from resolutions in the 0.8–2.0 range to resolve fine neuronal subtypes, while a more homogeneous tissue may be adequately described at 0.2–0.6. The most reliable way to evaluate resolution is to examine differential marker expression between neighboring clusters: if two clusters are distinguished only by mitochondrial or ribosomal gene expression, or by a single gene of uncertain biological relevance, the resolution is likely too high. Conversely, if a cluster known to contain multiple cell types (for example, a "neuronal" cluster mixing excitatory and inhibitory markers) fails to split at the chosen resolution, the resolution is too low. Iterative testing across a range of resolutions and biological review of the resulting partitions is more informative than any automated metric.
Q: Can I use scRNA-seq reference atlases for annotating snRNA-seq data?
A: Yes, with caveats. Reference-based tools such as Seurat label-transfer, Symphony, and SingleR can map snRNA-seq nuclei to scRNA-seq-derived references, and for major cell lineages the results are often concordant. However, two limitations deserve attention. First, gene expression differences between nuclear and whole-cell transcriptomes mean that some marker genes will have systematically different expression levels across data types — a canonical surface marker that works well in scRNA-seq may appear weakly expressed or absent in snRNA-seq, and vice versa. Second, cell states defined by cytoplasmic processes (secretory activity, metabolic flux, stress responses) may be captured differently or not at all in nuclear data. For these reasons, reference-derived labels are best treated as starting hypotheses to be confirmed with marker-based review, particularly for subpopulations, disease states, or tissues not well represented in the reference.
Q: Why use pseudobulk differential expression instead of testing at the single-nucleus level?
A: Single-nucleus-level differential expression tests treat each nucleus as an independent replicate, but nuclei from the same sample are not independent — they share donor genetics, sample handling, and technical batch effects. This pseudoreplication inflates the effective sample size and produces unacceptably high false-positive rates, sometimes exceeding 50% in datasets with hundreds of nuclei per sample. Pseudobulk analysis avoids this problem by aggregating counts across all nuclei of a given cell type within each biological replicate, producing one observation per sample per cell type. These aggregated profiles can then be analyzed with standard bulk RNA-seq tools (DESeq2, edgeR, limma-voom) that properly model biological variation. The result is more conservative but far more reliable: differentially expressed genes identified by pseudobulk analysis replicate across independent cohorts at substantially higher rates than those from single-cell-level tests.
References
- Ding J, Adiconis X, Simmons SK, et al. Systematic comparison of single-cell and single-nucleus RNA-sequencing methods. Nature Biotechnology. 2020;38:737-746. doi:10.1038/s41587-020-0465-8
- 10x Genomics. Interpreting Single Cell Gene Expression Data With and Without Intronic Reads. Technical Note CG000554, Rev A. 2022. 10x Genomics Support
- Amezquita RA, Lun ATL, Becht E, et al. Orchestrating single-cell analysis with Bioconductor. Nature Methods. 2020;17:137-145. doi:10.1038/s41592-019-0654-x
- Schmauch E, Ojanen J, Galani K, et al. QClus: a droplet filtering algorithm for enhanced snRNA-seq data quality in challenging samples. Nucleic Acids Research. 2024;53(1):gkae1145. doi:10.1093/nar/gkae1145
- Hafemeister C, Satija R. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biology. 2019;20:296. doi:10.1186/s13059-019-1874-1
- Hao Y, Stuart T, Kowalski MH, et al. Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nature Biotechnology. 2024;42:293-304. doi:10.1038/s41587-023-01767-y
- Aran D, Looney AP, Liu L, et al. Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage. Nature Immunology. 2019;20:163-172. doi:10.1038/s41590-018-0276-y
- Wu T, Hu E, Xu S, et al. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021;2(3):100141. doi:10.1016/j.xinn.2021.100141
- McGinnis CS, Murrow LM, Gartner ZJ. DoubletFinder: Doublet detection in single-cell RNA sequencing data using artificial nearest neighbors. Cell Systems. 2019;8(4):329-337. doi:10.1016/j.cels.2019.03.003
- Wolock SL, Lopez R, Klein AM. Scrublet: Computational identification of cell doublets in single-cell transcriptomic data. Cell Systems. 2019;8(4):281-291. doi:10.1016/j.cels.2018.11.005
- 10x Genomics. Best Practices for Analysis of Single Cell RNA-seq Data. Analysis Guides. 2024. 10x Genomics Analysis Guides