Illumina sequencing by synthesis (SBS) has become the dominant platform in genomics not because it is simple, but because its underlying engineering—from surface chemistry to optical detection—has been refined through two decades of continuous innovation. Understanding what happens inside the sequencing instrument is essential for designing robust experiments, troubleshooting failed runs, and interpreting data quality metrics.
This article focuses specifically on the physical and chemical processes that occur inside the Illumina flow cell and imaging system. It does not cover library preparation protocols (covered in separate resources) or bioinformatics analysis. Instead, it provides a detailed look at the molecular interface between the library and the flow cell, the amplification chemistry that turns single molecules into detectable clusters, the sequencing-by-synthesis cycle, the optical system that converts fluorescence into base calls, and the factors that determine final data quality.
Throughout each section, we link the scientific principles directly to practical outcomes: how understanding these mechanisms helps explain common sequencing problems—from low cluster density and poor Q-scores to unexpected read duplication—and how they inform better experimental design and troubleshooting.
The Illumina Sequencing Ecosystem in 2026
The Illumina sequencing workflow follows a four-stage pipeline: sample preparation, library construction, cluster generation and sequencing, and data analysis. This article focuses on stage three—the core sequencing process—which encompasses the chemical, optical, and fluidic operations inside the instrument.
The sequencing instrument is not a black box. Every stage of the process—from the moment a library molecule first contacts the flow cell surface to the final base call in a FASTQ file—involves well-characterized physical and chemical mechanisms. Understanding these mechanisms allows researchers to predict how changes in library quality, loading concentration, or read length will affect data output, rather than discovering the consequences after a failed run.
The quality of the sequencing data ultimately depends on the quality of the input library. Rigorous library QC and data analysis services are essential for translating raw sequencing output into reliable biological conclusions.
Why this understanding matters for experimental design: A researcher who understands phasing physics will not be surprised when a 2 × 300 bp run produces lower Q-scores than a 2 × 150 bp run. A lab manager who knows how ExAmp technology works can make an informed decision about whether to use a patterned or non-patterned flow cell for a specific project. And a bioinformatics specialist who understands 2-channel vs. 4-channel imaging can predict which types of libraries will perform well on which platform without running a pilot test. For labs preparing their own samples, reliable NGS services offer the platform expertise needed to match experimental goals with optimal sequencing configurations.
A brief note on read types: Single-read (SR) sequencing reads one end of each fragment. Paired-end (PE) sequencing reads both ends, producing two reads per fragment. Paired-end reads provide more accurate alignment to reference genomes, enable detection of structural variants, and improve de novo assembly quality. The majority of modern NGS applications use paired-end sequencing, with read lengths adapted to the specific application—2 × 150 bp for most whole genome sequencing and RNA-seq, 2 × 250 bp or 2 × 300 bp for 16S/ITS amplicon sequencing where full-length variable region coverage is needed.

Figure 1. NGS pipeline overview — from sample to data, highlighting the scope of this article
Caption: Four-stage Illumina NGS pipeline showing sample preparation, library construction, cluster generation and sequencing (the focus of this article), and data analysis, with emphasis on the core sequencing process inside the instrument.
Step 1 — The Molecular Interface: How Libraries Attach to the Flow Cell
Before any sequencing can occur, library molecules must be physically anchored to the flow cell surface. This occurs through sequence-specific hybridization between the adapter sequences on each library molecule (P5 and P7) and complementary oligonucleotides covalently attached to the flow cell surface—a dense “lawn” of capture probes. The density of this lawn is approximately 10⁶-10⁷ probes per µm², far exceeding the number of library molecules that can be loaded, ensuring that capture efficiency is limited by library availability rather than probe density.
The hybridization efficiency determines what fraction of loaded library molecules successfully anchor and go on to form clusters. Several factors influence this efficiency:
- Adapter sequence integrity: Truncated or damaged adapter sequences cannot hybridize effectively. This is a hidden failure mode—library quantification by qPCR may indicate sufficient material, but if a significant fraction of adapter sequences are damaged, effective cluster density will be lower than expected. The most common cause of adapter damage is excessive freeze-thaw cycling of the library or exposure to nucleases during storage.
- Salt concentration during loading: The hybridization buffer’s ionic strength directly affects the melting temperature of the adapter-capture duplex. Suboptimal salt conditions can reduce hybridization efficiency by 30-50%, which is why Illumina loading buffers are formulated with precise salt concentrations optimized for the specific flow cell type.
- Competition from adapter dimers: Adapter dimer molecules carry the same P5/P7 sequences as intact library molecules and compete for capture sites on the flow cell surface. A library with 5% adapter dimer content effectively loses 5% of flow cell capture capacity to non-informative clusters that produce short, unalignable reads.
- Loading time and temperature: The hybridization step requires sufficient time for library molecules to diffuse to the surface and find complementary probes. Loading at elevated temperatures can increase hybridization stringency but may denature some library molecules. The standard loading protocol balances these factors to achieve maximum on-target anchoring.
For researchers preparing their own libraries, the key takeaway is that library quality—not just quantity—directly determines the number of usable clusters that form on the flow cell. Comprehensive NGS sequencing services include rigorous library QC that verifies adapter integrity and quantifies dimer content before loading.

Figure 2. Flow cell surface hybridization — P5/P7 adapter anchoring to the oligonucleotide lawn
Caption: Molecular diagram showing library adapter (P5/P7) hybridization to complementary capture probes on the flow cell surface, with factors affecting hybridization efficiency including adapter integrity, salt concentration, and adapter dimer competition.
Step 2 — Cluster Generation: From Single Molecules to Detectable Signals
Single fluorophore molecules are undetectable by the sequencing instrument’s imaging system. Cluster generation solves this problem by amplifying each anchored library molecule into a clonal cluster of approximately 1,000 identical copies, producing a signal strong enough for reliable detection.
Bridge amplification: After adapter hybridization, the anchored single-stranded template folds over and hybridizes to a second complementary oligonucleotide on the flow cell surface, forming a “bridge.” A polymerase extends this bridge, creating a double-stranded copy. Denaturation releases the two strands, and the process repeats. After 35 cycles of bridge amplification, a single molecule has produced a cluster of approximately 1,000 strands. The critical parameter during bridge amplification is the temperature cycling profile—each cycle involves a defined sequence of annealing, extension, and denaturation temperatures, and deviations of even 1-2°C can shift the efficiency by 10-20%.
Patterned vs. non-patterned flow cells: This is one of the most consequential design differences across Illumina platforms, and choosing between them directly affects data quality and project cost.
- Non-patterned flow cells (MiSeq, HiSeq, NovaSeq 6000): Capture oligonucleotides are randomly distributed across the surface. Clusters form in random locations and may overlap, causing signal interference. Cluster density must be carefully controlled—too high, and clusters overlap; too low, and data output is suboptimal.
- Patterned flow cells (NovaSeq X, NextSeq 2000): The surface contains pre-etched nanowells at fixed spacing (approximately 1 µm apart). Each nanowell can accommodate only one cluster before physical exclusion prevents a second molecule from occupying the same well.
ExAmp (Excluded Volume Amplification): This is the technology that makes patterned flow cells work. During bridge amplification in a nanowell, the growing cluster of amplified DNA physically fills the well volume. Once a well is occupied, the excluded volume effect prevents a second template molecule from initiating amplification in the same nanowell. ExAmp ensures that each nanowell produces a monoclonal cluster, eliminating the optical duplicates and signal overlap that reduce data quality on non-patterned platforms.
ExAmp has a subtle but important secondary benefit: it reduces the impact of loading concentration variability. On a non-patterned flow cell, the relationship between library loading concentration and effective cluster density is highly sensitive—a 10% change in loading concentration can shift cluster density by 20-30%. On a patterned flow cell with ExAmp, the nanowell occupancy follows a Poisson distribution that is more predictable, meaning loading concentration has a broader tolerance window. This makes patterned flow cells more forgiving of quantification errors.
The practical consequence: patterned flow cells with ExAmp technology produce higher effective cluster densities (250-350 K clusters per mm²) with lower duplication rates than non-patterned surfaces, even with the same loading concentration. For a project requiring 1 billion reads, a patterned flow cell can deliver the required data output with less library material and fewer reruns.

Figure 3. Patterned vs. non-patterned flow cell comparison — cluster arrangement and signal isolation
Caption: Comparative diagram of patterned (nanowell-based) and non-patterned (random-distribution) flow cell surfaces showing cluster arrangement, signal isolation differences, and the ExAmp exclusion volume amplification principle.
Step 3 — Sequencing by Synthesis: The Four-Stroke Chemical Engine
Once clusters are formed, the instrument begins the sequencing run. Each base is read through a repeating four-step cycle. The precision required at each step is extreme—the imaging system must detect fluorescence signals from clusters that are less than 1 µm apart, and the fluidics system must deliver and remove reagents with sub-milliliter accuracy across a flow cell with microchannel geometry.
- Incorporation: A DNA polymerase adds a single fluorescently labeled, reversibly-terminated nucleotide to the growing strand. Because the nucleotide is chemically blocked at the 3′ position, only one base can be added per cycle.
- Imaging: The instrument’s optical system excites the fluorescent label at specific wavelengths and captures images through 2 or 4 color channels. The intensity and wavelength of the signal at each cluster location determines the base identity.
- Cleavage: The fluorescent dye and the 3′ blocking group are simultaneously removed, restoring the strand’s ability to extend in the next cycle.
- Wash: Cleaved dye fragments and unincorporated reagents are flushed from the flow cell before the next cycle begins.
The time per cycle varies by platform and chemistry version. On the NovaSeq 6000 with standard SBS chemistry, each cycle takes approximately 5-10 minutes. The NovaSeq X with XLEAP-SBS chemistry reduces cycle time to 3-5 minutes through faster enzyme kinetics and a redesigned imaging system that captures the full flow cell in fewer exposures. The signal intensity in XLEAP-SBS is also 30-40% higher than standard SBS, reducing base-calling error rates. For projects requiring the highest throughput with this chemistry, large-scale WGS services can leverage the NovaSeq X’s full capacity.

Figure 4. The SBS four-step cycle — incorporation, imaging, cleavage, and wash
Caption: Detailed schematic of the sequencing-by-synthesis cycle showing four sequential steps—fluorescent nucleotide incorporation, multi-channel imaging, dye and blocking group cleavage, and wash—with phasing and pre-phasing effects noted.
The Physics of Phasing — Why Read Quality Declines Over Cycles
Every Illumina sequencing run shows a characteristic decline in quality scores from the first cycle to the last. This decline is not an instrument defect—it is a predictable consequence of imperfect molecular synchrony during the SBS cycle.
Phasing (lagging): In each cycle, a small fraction of templates in a cluster fail to incorporate a nucleotide. These templates fall one base behind the majority. The fraction is small per cycle (typically 0.1-0.5%), but the effect is cumulative. After 100 cycles, 10-15% of the molecules in a cluster may be one or more bases out of phase. The primary cause of phasing is incomplete washing between cycles—residual blocked nucleotides from the previous cycle occupy the active site of some polymerase molecules, preventing incorporation of the current cycle’s nucleotide.
Pre-phasing (leading): A smaller fraction incorporate two bases in a single cycle, either because the blocking group was incompletely attached during manufacturing or because the cleavage step from the previous cycle left some 3′ ends unblocked. Pre-phasing rates of 0.05-0.2% per cycle are normal and account for roughly one-third of the total synchrony loss.
Cumulative signal degradation: As phasing and pre-phasing accumulate, the fluorescence signal from a cluster becomes a mixture of bases from different positions. The instrument’s base-calling software estimates and corrects for phasing using mathematical models, but the correction becomes less effective as the out-of-phase fraction grows. This sets a practical limit on read length.
Practical implications: For a 2 × 150 bp run, phasing effects are manageable and >85% Q30 is typical. For a 2 × 300 bp run, phasing accumulation reduces the usable fraction to 75-80% Q30. XLEAP-SBS chemistry reduces phasing rates through faster enzyme kinetics and more efficient washing, extending the practical read length at high quality. Researchers planning experiments that require long reads can combine this insight with targeted targeted sequencing approaches to optimize coverage for specific genomic regions.

Figure 5. Phasing accumulation curve — cluster synchrony decay from cycle 1 to cycle 300
Caption: Phasing accumulation curve showing the progressive loss of cluster synchrony from cycle 1 to cycle 300, with phasing (lagging) and pre-phasing (leading) effects contributing to cumulative signal degradation and Q-score decline.
Step 4 — The Imaging System: From Photons to Base Calls
The fluorescence signals from each cycle must be converted into base calls. The imaging architecture determines the accuracy and speed of this conversion.
The imaging system is arguably the most engineered component of any Illumina sequencing platform. The transition from 4-channel to 2-channel imaging was not merely a cost-saving measure—it represented a fundamental rethinking of the signal detection strategy.
4-channel imaging (standard): Each of the four nucleotides (A, C, G, T) is labeled with a distinct fluorophore, and the instrument images at four separate wavelengths. Each base occupies one channel. This produces the highest degree of spectral discrimination and is the most accurate approach. However, four images must be captured per cycle, limiting imaging speed. Used on MiSeq and some HiSeq platforms.
2-channel imaging (efficiency): Instead of four distinct fluorophores, the 2-channel system uses two dyes and a combinatorial logic approach. A mix of red and green fluorescence produces four possible states:
| State | Red Channel | Green Channel | Called Base |
|---|---|---|---|
| No signal | Off | Off | G |
| Red only | On | Off | A |
| Green only | Off | On | C |
| Both | On | On | T |
2-channel imaging is faster (two exposures per cycle instead of four), enabling higher throughput on platforms like NovaSeq and NextSeq. The trade-off is that the “no signal” state (G) provides no positive fluorescence—meaning a failed cluster that produces no signal in either channel is indistinguishable from a G base. This is managed through signal normalization using high-quality sequencing data, typically from a PhiX control spike-in, which calibrates the expected signal ratios for each base. For projects multiplexing many samples, robust genotyping services rely on accurate base calling across all imaging channels.
Imaging system evolution across platforms: The NovaSeq 6000 uses a 2-channel CMOS (complementary metal-oxide-semiconductor) imaging system with two-color laser excitation. The NovaSeq X upgrades to a higher-resolution CMOS sensor with improved quantum efficiency, capturing more photons per cluster per cycle. This contributes to the higher Q-scores observed on NovaSeq X runs despite the shorter 2-channel cycle time. The MiSeq and older HiSeq platforms use 4-channel CCD (charge-coupled device) sensors, which provide higher sensitivity at the cost of slower readout speed.
For low-diversity libraries such as amplicon panels, the limited nucleotide diversity means fewer clusters contribute to the calibration, making 2-channel platforms more sensitive to diversity-related quality issues than 4-channel instruments.

Figure 6. 2-channel versus 4-channel imaging — base calling logic and signal combination
Caption: Comparison of 4-channel and 2-channel imaging architectures showing the fluorophore assignment per base, the combinatorial logic of 2-channel red/green signal states for base calling, and platform-specific implementations (CCD vs CMOS sensors).
Understanding Q-Scores — The Currency of Sequencing Quality
The Phred quality score (Q) is the standard metric for sequencing accuracy. It is defined as Q = −10 log₁₀(P), where P is the probability of an incorrect base call. A Q30 score corresponds to an error probability of 1 in 1,000 (99.9% accuracy); Q20 corresponds to 1 in 100 (99% accuracy).
For a typical Illumina run:
- >85% of bases at Q30 or higher for 2 × 150 bp runs
- >75% of bases at Q30 for 2 × 250 bp or longer runs
How to read a sequencing QC report: The standard Illumina analysis viewer displays several key metrics that should be reviewed after every run. The per-cycle quality heatmap shows the distribution of quality scores across all cycles—a gradual decline from left to right is normal, while a sharp drop at any point indicates a transient issue during the run. The base composition plot should show balanced A/T and G/C curves for diverse libraries. The GC content distribution should match the expected values for the sequenced genome or transcriptome. The duplication rate should be below 15% for most library types; higher values suggest low input DNA or excessive PCR cycles.
Why low-diversity libraries need PhiX spike-in: The instrument’s base-calling software uses the first few cycles of sequencing to calibrate the relative signal intensities for each base. If the library has low nucleotide diversity—for example, all clusters showing the same base at the same position—the calibration cannot distinguish true signal from systematic bias. Adding PhiX control library (typically 5-20% of total library mass) provides a reference with balanced base composition, enabling correct signal normalization. Without PhiX, a low-diversity library run on a 2-channel platform may produce uniformly low Q-scores across all cycles. The required PhiX proportion increases as library diversity decreases: for a 16S library with 50% GC content, 5-10% PhiX may be sufficient; for a highly uniform amplicon panel, 15-20% may be needed.

Figure 7. Typical Q-score distribution across a 2 × 150 bp Illumina run
Caption: Per-cycle Q-score distribution heatmap showing the typical quality decline from high Q-scores in early cycles to lower scores in later cycles due to cumulative phasing effects, with Q30 and Q20 thresholds marked.
Factors Affecting Data Quality — A Deeper Diagnostic Look
Sample loading density: The relationship between loading concentration and cluster density is not linear and follows a predictable saturation curve. At low loading concentrations, each additional library molecule has a high probability of finding an unoccupied capture site, and cluster density increases nearly linearly with loading. As concentration increases, capture sites become occupied and new molecules compete for diminishing available space. On patterned flow cells, the saturation point is determined by the number of nanowells, and the loading curve follows an exponential approach to the maximum well occupancy. Optimal cluster density is typically 250-350 K per mm² for patterned flow cells and 150-250 K per mm² for non-patterned surfaces. Below these ranges, data output is lower than the flow cell’s capacity; above them, clusters begin to overlap in non-patterned surfaces or become difficult to resolve in patterned surfaces, lowering Q-scores and reducing usable reads.
GC content bias: GC-rich regions form stable secondary structures that reduce amplification efficiency during both library preparation and cluster generation. This produces lower coverage in GC-rich areas of the genome. The effect is more pronounced in PCR-based library preparation methods and less severe in PCR-free protocols, but it cannot be completely eliminated because cluster generation itself involves an amplification step. For regions with >70% GC content, coverage may drop to 50-60% of the genome average. Researchers studying GC-rich regions should factor this into their coverage planning by sequencing to higher raw depth.
Salt and solvent carryover: Residual salts, EDTA, or organic solvents from sample preparation inhibit the polymerase used during SBS. This produces uniformly low signal intensities from the first cycle onward—a distinct pattern from phasing-related quality decline, which starts high and degrades gradually. Libraries with high salt carryover will show low Q-scores even in early cycles. The typical diagnostic: if Q-scores are uniformly low across all cycles (no downward slope), the cause is likely a cleanroom contamination issue rather than a chemistry or loading problem.
From Raw Data to Usable Sequence — The BCL to FASTQ Pipeline
The sequencing instrument converts fluorescence images into base calls through a multi-step process:
- Image analysis: The instrument identifies the location of each cluster on the flow cell surface and measures the fluorescence intensity at each location for each imaging channel. For patterned flow cells, this process is simpler because cluster locations are predetermined by the nanowells. For non-patterned flow cells, the software must identify clusters de novo by detecting local intensity maxima in the first few cycles.
- Base calling: The intensity values are converted into base calls with associated quality scores. Modern instruments perform this in real time using onboard processing, enabling researchers to monitor data quality during the run rather than waiting for post-run analysis.
- BCL to FASTQ conversion: The instrument’s binary base call (BCL) files are demultiplexed by index sequence and converted into FASTQ format. This step is typically performed after the run completes, using Illumina’s bcl2fastq or the newer Dragon analysis pipeline. The output is one FASTQ file per sample per read direction (R1 and R2 for paired-end runs).
A typical 30× human WGS run produces approximately 90-100 Gb of raw data, translating to roughly 200 GB of FASTQ files per sample (100 GB for R1 and 100 GB for R2). The quality of this data is determined by every step in the sequencing process—from library quality through cluster density, phasing control, and base-calling accuracy. A single poorly prepared library can reduce the effective data output of an entire lane or flow cell by 30-50%.
How read length affects coverage and data quality: The choice between 2 × 150 bp, 2 × 250 bp, and 2 × 300 bp sequencing is not just about getting longer reads. Longer reads provide more mappable sequence per fragment, improving alignment accuracy in repetitive regions and enabling more robust isoform detection in RNA-seq. However, as discussed in the phasing section, longer reads come with lower quality in later cycles and reduced total read counts per flow cell (because each cycle takes time, and longer runs can process fewer total reads per unit time). For most human WGS applications, 2 × 150 bp is the standard because it provides sufficient read length for accurate alignment while maintaining high throughput and quality. For amplicon-based applications requiring full coverage of variable regions (e.g., 16S rRNA V3-V4), 2 × 300 bp on MiSeq is often necessary despite the lower per-read quality in the final cycles. Amplicon sequencing services are optimized for these applications with appropriate read length configurations.
Data volume planning: For a project requiring 30× human WGS on 96 samples, the total expected FASTQ volume is approximately 9.6-10 TB. Storage, transfer, and computational analysis capacity must be planned accordingly. Cloud-based analysis platforms that charge by the gigabyte are often more cost-effective for large projects than on-premises infrastructure. Professional bioinformatics analysis services can handle large-scale data processing and interpretation, bridging the gap between raw sequencing output and biological insight.
FAQ
What causes phasing in Illumina sequencing?
Phasing occurs when some templates in a cluster fail to incorporate a nucleotide during a given SBS cycle, causing them to lag behind the majority of molecules in the cluster. The effect accumulates over successive cycles and is the primary reason quality scores decline toward the end of a read.
What is the difference between patterned and non-patterned flow cells?
Patterned flow cells have pre-etched nanowells at fixed positions, each capable of supporting one cluster. Non-patterned flow cells have randomly distributed capture oligonucleotides. Patterned flow cells eliminate cluster overlap and enable higher usable cluster densities.
Why do Q-scores decline toward the end of a read?
The decline is primarily caused by phasing and pre-phasing effects that accumulate over successive SBS cycles. After 150-300 cycles, 10-20% of molecules in each cluster may be out of phase, degrading the accuracy of the consensus base call.
How does 2-channel imaging differ from 4-channel imaging?
2-channel imaging uses a combinatorial logic system where two dyes and their on/off states encode four bases. It is faster than 4-channel imaging but more sensitive to low-diversity libraries. 4-channel imaging assigns a distinct fluorophore to each base, providing higher spectral discrimination.
Why do low-diversity libraries need PhiX spike-in?
Low-diversity libraries do not provide the balanced base composition needed for the instrument to calibrate signal intensities correctly. PhiX control library provides a diverse reference that enables accurate signal normalization across all four bases.
What is the ideal cluster density on a patterned flow cell?
The optimal range is typically 250-350 K clusters per mm². Below this range, data output is lower than the flow cell’s capacity; above it, cluster resolution decreases and Q-scores drop.
How does XLEAP-SBS chemistry improve sequencing speed?
XLEAP-SBS uses a redesigned polymerase with faster nucleotide incorporation kinetics, reducing each SBS cycle by 30-50%. Combined with a faster imaging system, this reduces 2 × 150 bp run time from ~40 hours to ~24 hours on the NovaSeq X platform.
What causes GC bias in Illumina sequencing data?
GC-rich regions form stable secondary structures that reduce polymerase processivity during both library amplification and cluster generation. This results in lower sequencing coverage in these regions. PCR-free library preparation reduces but does not eliminate this effect.
How are base calls converted from fluorescence signals?
The instrument’s base-calling software compares the fluorescence intensity at each cluster position across the imaging channels, applies phasing correction, and assigns the most likely base with a quality score. The output is written as BCL files, which are later converted to FASTQ format.
What single factor causes the most sequencing failures?
Inaccurate library quantification is the single most common cause of sequencing failures. Overestimated concentration leads to over-clustering (cluster overlap and signal confusion); underestimation leads to under-clustering (wasted flow cell capacity). Using qPCR-based quantification, which measures only amplifiable library molecules, provides the most reliable loading concentration.
How do I distinguish between phasing-related quality loss and contamination-related quality loss?
Phasing produces a characteristic downward-sloping Q-score trend—high quality in early cycles, progressively lower in later cycles. Contamination or carryover produces uniformly low quality across all cycles—if cycle 1 is already at Q20, the problem is not phasing but rather a sample or reagent chemistry issue.
What is the difference between Q30 and Q40?
Q30 represents 99.9% base call accuracy (1 error per 1,000 bases). Q40 represents 99.99% accuracy (1 error per 10,000 bases). The higher Q-score threshold is used for applications requiring extremely high accuracy, such as rare variant detection or clinical sequencing.
References: