Single-cell RNA Sequencing: Quality Control

Quick Overview

01 Key Considerations for Cell Separation in Single-Cell Sequencing 02 Cell Sorting 03 Cell Lysis 04 Reverse Transcription 05 Amplification Process 06 Library Preparation and Sequencing 07 Quality Control Checkpoints of Single-cell RNA-Seq 08 How to filter cells 09 How to Filter Genes 10 How to Address Batch Effects

Currently, single-cell RNA sequencing has emerged as a prominent and timely subject. It offers invaluable insights that are not attainable through traditional bulk RNA sequencing, specifically when it comes to investigating developmental biology, tumor biology, immunity, and related fields. At the heart of single-cell sequencing lies T-sne dimensionality reduction and clustering techniques, which facilitate the exploration and analysis of data. However, it is crucial to emphasize that the success or failure of the entire analysis heavily relies on meticulous quality control measures conducted prior to these steps. In this article, we provide a comprehensive overview of single-cell quality control.

Various factors influence preferences in single-cell RNA sequencing, including:

Amplification preferences: Certain highly expressed mRNAs may encounter limitations during the amplification process.
Drop-out rates: Some mRNAs may fail to amplify, leading to their omission from the analysis.
Transcriptional bursting: The sporadic nature of transcriptional activity can affect the accuracy of measurements.
Background noise: Unwanted signals and technical noise can obscure the desired biological signals.
Preferences influenced by cell cycle and cell size: Variations in cell cycle stages and cell sizes can impact sequencing results.
Batch effect: Discrepancies arising from different experimental batches can introduce biases and hinder accurate comparisons.
Correlation analysis following technical replication of the same sample: Replicating technical procedures allows for assessing the reliability and reproducibility of the results.

By understanding and addressing these preferences, researchers can enhance the reliability and validity of single-cell RNA sequencing studies.

Key Considerations for Cell Separation in Single-Cell Sequencing

Prior to performing single-cell sequencing, it is essential to separate the cells effectively. Failure to do so within a limited timeframe can adversely impact cell integrity, potentially resulting in RNA leakage from the cells. Here are several important factors to bear in mind when isolating single cells from tissues:

Incomplete Cell Separation: It is possible for multiple cells to adhere together during the separation process.
Cell Damage and RNA Degradation: Inadequate cell separation conditions can harm the cells, leading to RNA degradation or leakage.
Background Signal from RNA Leakage: Leakage of RNA during cell separation can contribute to unwanted background signals.
Biased Cell Isolation: The cell isolation procedure may introduce bias, where specific cell types are preferentially isolated. Moreover, the process itself can induce changes in gene expression.

Therefore, when analyzing clustering results, it is crucial to thoroughly examine whether there are genes that exhibit specific expression patterns in particular cell groups, which could be attributed to the cell separation experiment.

Cell Sorting

When it comes to cell sorting, we encounter several challenges, including:

Inconsistent cell distribution: Existing single-cell sequencing methods often face the problem of encountering empty droplets or wells, as well as cases where multiple cells are present within a single droplet.
Cell size preferences: Many single-cell assays exhibit a preference for specific cell sizes. For instance, techniques like dropseq impose an upper limit on cell size.
Cell type preferences: There is often a preference for sorting specific cell types in single-cell experiments.
Cell damage and background noise: Prolonged sorting experiments can damage cells and introduce background noise, which can affect the quality of the data obtained.

To address these challenges, different strategies for sequencing single cells have been developed. It is crucial to carefully select the appropriate single-cell strategy for studying specific tissues. Furthermore, low cell quality or the presence of dead cells or cellular debris can result in multiple cells being encapsulated within droplets. During the subsequent data analysis, these droplets may either form a distinct cluster or appear enriched between two cell groups.

To determine the presence of droplets containing multiple cells, the following criteria are typically used:

High molecular barcodes value: An elevated molecular barcode value indicates the likelihood of a droplet containing multiple cells.
Identification of cells characterized by multiple cell populations: Specific cell populations that display characteristics of multiple cell types can be indicative of droplets containing multiple cells.
In the case of 10X single-cell RNA sequencing, the proportion of doublets can be predicted, which directly correlates with the number of cells present.

Currently, several software tools are available to assist in identifying doublets, such as:

DoubletFinder
Scrublet
DoubletDecon
DoubletCluster/DoubletCell in Scran

These doublet detection algorithms exhibit similarities in their approach and follow a basic principle consisting of the following steps:

Random cell merging: Two cells are randomly merged to simulate doublets.
Data re-dimensionalization and clustering: The merged data is re-dimensionalized and clustered to identify groups of cells.
Removal of identified doublets: The cells that cluster with the simulated doubles are identified and subsequently removed from the analysis.

Cell Lysis

Prior to conducting single cell sequencing, it is necessary to lyse the cells. The lysis conditions will vary depending on the cell tissues being studied. If the lysis conditions are excessively stringent, it will adversely impact the library preparation.

Reverse Transcription

The efficiency of reverse transcriptase is of utmost importance. The dropout rate typically ranges from 60% to 90%. In cases where the same cell line is processed in the same manner but using two different libraries, the dropout rate may exhibit significant variation.

Amplification Process

Every amplification step can introduce biases. Many single-cell transcriptome sequencing techniques utilize molecular barcodes as a measure to help correct for amplification-induced biases. However, full-length transcriptomes such as SmartSeq2 lack molecular barcodes, making it impossible to correct for amplification preferences using molecular barcodes-based methods.

Library Preparation and Sequencing

Utilizing Spike-in RNAs, a collection of RNA transcripts with known sequences, the library construction process involved the addition of spike-in molecules at known concentrations. This set of spike-ins included:

ERCC: Comprising 92 RNAs derived from various bacteria, possessing different lengths and GC contents, which were incorporated at 22 distinct concentrations.
SIRV: Consisting of 69 synthetic transcripts designed to mimic human genes. Primarily employed to validate the capability of sequencing results in detecting isoforms within human genes.

Applications of Spike-ins:

Technical Noise Removal: Spike-ins aid in eliminating technical noise present during library preparation and sequencing procedures.
Capture Efficiency Detection: They facilitate the assessment of capture efficiency, gauging how effectively target RNAs are captured.
Calculation of RNA Initiation: Spike-ins assist in calculating RNA initiation rates, contributing to the understanding of transcriptional activity.
Data Normalization: They enable the normalization of data, ensuring accurate comparisons across different samples.

Limitations of Spike-ins:

Despite their utility, spike-ins still differ from endogenous genes, particularly in terms of amplification preference. This disparity must be taken into consideration when interpreting the results. Furthermore, spike-ins are generally not utilized in drop-seq methodologies.

Quality Control Checkpoints of Single-cell RNA-Seq

Typically, the checkpoints for quality control (QC) include the following:

Rate of Unique Matches
Proportion of Matches to Exonic Regions
3' Preference in Single-Cell Full-Length Transcripts
Reads Matched to mRNA
Molecular barcodes/Reads Ratio
Number of Detected Genes
Detection of Spike-in RNA
Mitochondrial-to-Ribosomal RNA Ratio

A low ratio or a low number of reads can be attributed to issues with library construction. A low number of reads may result from an increased formation of primer dimers, while a low ratio is typically indicative of problems during library construction.

The absence of spike-in RNA sequences directly indicates failure in library construction. However, if the spike-in is normal and the cell exhibits a low number of RNA sequences, it could be due to the small size of the cell or damage to the cell before library construction.

The number of detected genes is directly linked to the size of the cell. If an excessive number of genes (molecular barcodes) are detected, it is likely that multiple cells are present within the droplet. However, it cannot be ruled out that the cell itself is simply very large. As shown below, having too many or too few genes is not considered normal.

Generally, there is a positive correlation between cell size, spike-in RNA ratio, and the number of detected genes. Elevated levels of mitochondrial RNA also indicate a broken cell. When a cell breaks, cytoplasmic RNA is released, but mitochondrial RNA remains encapsulated within the mitochondrial membrane. Therefore, when the cell membrane is damaged, the percentage of mitochondrial RNA becomes elevated. Note: This phenomenon can also occur during apoptosis or necrosis.

High levels of ribosomal RNA may indicate increased RNA degradation within the cell. In full-length single-cell transcriptomes, 3' preference can be utilized to identify substantial RNA degradation within the cell.

How to filter cells

Usually, most of the cells will have the same trend, and we combine multiple metrics to remove some of the cells that don't qualify. So take a look at the distribution of the data before deciding which cells need to be filtered out.

Based on PCA this algorithm can also be used for QC to find cells that are clearly not clustered with other cells. These cells are considered to be the ones that do not meet the quality control standards.

How to Filter Genes

The next step is to discuss how to filter the genes, for the vast majority of cases, we will not use all the genes to perform a downscaling analysis, so a gene set selection is needed.

The gene set is set based on:

Genes with expression above a certain threshold
Genes with differential variation across the cell sample
Using a priori knowledge to select genes
Differential genes that have been identified in bulk RNA sequencing.

Only the first few PCs are selected for t-SNE downscaling

How to Address Batch Effects

One of the most challenging issues in single-cell RNA sequencing revolves around batch effects. Batch effects can manifest in various scenarios, such as:

Distinct experiments conducted on diverse animals, patients, or cells.
Varied sequencing lanes employed during the experiments.

To mitigate batch effects, it is essential to establish distinct quality control standards for different sample batches. One approach involves utilizing principal component analysis (PCA) to identify any conspicuous batch effects within the obtained results.

For research purposes only, not intended for clinical diagnosis, treatment, or individual health assessments.