DNA sequencing is a critical tool in biological research. Traditional Sanger sequencing has been the most used method for more than three decades since it was first introduced in 1977.1 It is based on the use of radioactively or fluorescently labeled chain-terminating nucleotides (dideoxy nucleotides).1-3 Nevertheless, traditional Sanger sequencing is restricted to the discovery of substitutions and small insertions and deletions (indels) and, thus, new and exciting technology, enabling faster and more accurate sequencing to occur in parallel reactions, was emerged.4
Next Generation Sequencing (NGS), massively parallel or deep sequencing are related terms that describe a DNA sequencing technology which has revolutionized genomic research.2-5 Most so-called ‘massively parallel’ or ‘next-generation’ sequencing methods are based on the standpoint of Sanger fundamental enzymological support, but in an orchestrated and stepwise fashion, enabling sequence data to be generated from tens of thousands to billions of templates simultaneously.6
There are three main types of NGS sequencing of DNA that can be used for the identification of genomic mutations: whole-genome sequencing (WGS), whole-exome sequencing (WES) and targeted-deep sequencing (TDS)7. WGS provides data of the entire human genome, but as approximately 98% of the genome is non-coding and little is known about involvement of non-coding DNA in MDS pathogenesis, this method is rarely used clinically in MDS patients.8 WES has played an important role in the study of myeloid neoplasms, particularly in unveiling the clonal evolution in MDS and AML.9,10 Although WES and WGS may be considered a priori the best options due to the high number of assessed regions, they actually have limitations to incorporate into routine clinical practice. Some of these limitations are the high sequencing costs, the time-consuming bioinformatics analysis and the laborious clinical interpretation of the sequencing results.11-13 In addition, WGS and WES can miss some subclonal mutations with low VAF, but with clinical relevance, such as TP53 mutations.14 For these reasons, TDS, that allows sequencing only a target regions of interest, is a powerful approach that can fulfil the best balance between the accurate identification of targeted events with great sensitivity and the overall cost and data burden for large-scale executions. Additionally, read depths of TDS are greater and allow detection of VAF at lower fractions, being a key consideration in MDS study because it displays a very high intratumoral heterogeneity and the vast majority of known pathogenic variants are found in a relatively small number of genes.3,15,16
TDS comes in two main forms, amplicon or capture-based. Amplicon-based enrichment utilizes specifically designed primers to amplify only the regions of interest prior to library preparation. In contrast, in capture-based approaches, the DNA is fragmented and targeted regions are enriched via hybridization oligonucleotide bait sequences attached to biotinylated probes, allowing for isolation from the remaining genetic material.17-19 Amplicon-based enrichment is the cheaper of the two technologies, shows a greater number of on target reads and requires much less starting material than capture, making it ideal if there is little DNA available for TDS. However, the coverage of the target regions is more uniform with hybrid sequencing.7 Moreover, capture has been shown to produce fewer PCR duplicates than amplicon enrichment (<40% vs. ~80%, respectively) and they can be removed computationally easier, because the random shearing of the DNA in capture platforms reduces the likelihood of two unique fragments aligning to the same genomic coordinates. Similarly, the long bait sequences used in capture allow the sequencing of difficult regions, such as repeated sequences, and a greater level of specificity in region selection.7 Therefore, capture based techniques provide more accurate and uniform target selection, while amplicon-based methodologies are often used in small scale experiments where sample quantity or cost are a factor.
A summary of the main types of NGS sequencing of DNA and of the two TDS approaches that can be used for the identification of genomic mutations is shown in Figure 1.7
Figure 1. The main types of next generation sequencing of DNA (adapted from Bewicke-Copley, Comput Struct Biotechnol J, 2019).7 Whole genome, whole exome and targeted deep sequencing are shown, as well as the two targeted deep sequencing approaches (amplicon and capture-based enrichment) that can be used for the identification of genomic mutations.
All approaches involve a series of general steps that are common to most NGS platforms: 1) sample preparation and choice of DNA sequencing strategy; 2) library preparation; 3) sequencing; and 4) data analysis.
Sample preparation and choice of DNA sequencing strategy
The first step in NGS is the extraction of genetic material, DNA or RNA. Although this step can be considered trivial, quantity and quality of the input material is crucial because the generation of micrograms of DNA from input material by PCR is a sensitive process subject to many variations.
Ideally, a NGS library would perfectly represent the DNA present in the biological specimen from which it was derived and it cannot create more information than was present in the original template. NGS libraries can be made from small amounts of DNA, but reducing the input may comprise assay sensitivity due to the loss of sequence heterogeneity by the overamplification of particular DNA molecules during the process of library construction (duplicate reads). In fact, these duplicate reads cause failure to detect sequence variants that were present in the original specimen, over- or under-representation of particular variants, and false-positive variants calls resulting from PCR errors that are propagated through library preparation and sequencing.20,21
Likewise, the choice of DNA sequencing strategy is one of the first steps that should be assessed. As it is described previously, there are several types of NGS approaches that differ in total sequence capacity, sequence read length, sequence run time and the quality and accuracy of the data. Therefore, these characteristics should be considered in order to choose the best strategy to be used regarding the specific clinical applications.13
Library preparation
Library preparation is a process that consists of the fragmentation of genomic DNA and the generation of a random set of short DNA fragments of a particular size range, usually 100 to 500 base pairs, flanked by platform-specific adapters at both ends.
As detailed previously, two main approaches are used for target enrichment during the library preparation: capture-based and amplicon-based methods. In the case of capture-based assays, library preparation begins with random shearing of the genomic DNA, by mechanic or enzymatic fragmentation, followed by capture of the regions of interest using biotinylated oligonucleotides probes complementary to these sequences. Then, targeted fragments are isolated using streptavidin magnetic beads. On the other hand, amplicon-based methods involve PCR amplification of the selected genome regions using primers, thus, the sequencing reads generated will have the same start and stop coordinates dictated by the primer design.3,6,13
In both cases, platform-specific adapters are attached at each end of the template fragments. These adapter sequences are complementary to oligonucleotides bound to beads or another solid surface, called ‘flow cell’, so that libraries can undergo bridge hybridization in the sequencer platform.3 Thereby, sequencing templates are immobilized on a proprietary flow cell surface designed to present the DNA in a manner that facilitates access to enzymes while ensuring high stability of surface-bound template and low non-specific binding of fluorescently labeled nucleotides.
Sequencing
Most commercial platforms using massively parallel sequencing are based on the concept of sequencing by synthesis. In essence, these methods allow nucleotide incorporation using a variety of enzymes and detection schemes that permit the corresponding instrument platform to collect data in lockstep with enzymatic synthesis on a template.6
Sequencing by synthesis consists of the amplification in parallel reactions of DNA fragments using bridge hybridization and PCR in the presence of fluorescently-labeled nucleotides. In details, after the hybridization of the amplified libraries on the flow cell, library fragments act as a template, from which a new complementary strand is synthesized by PCR, and template strand is removed. As complementary strands undergo bridge hybridization with neighboring adaptor binding sites, repeated PCR cycles lead to cluster amplification.3,6,22 Finally, the sequencer detects the fluorescence generated by the incorporation of fluorescently labeled nucleotides by PCR into the growing strand of DNA.22,23 During each sequencing cycle, a single labeled nucleotide is added to the nucleic acid chain, and it serves as a terminator for polymerization, so after each nucleotide incorporations, the fluorescence is imaged to identify the base and then enzymatically cleaved to allow incorporation of the next nucleotide. The result is highly accurate base-by-base sequencing that eliminates sequence-context specific errors, enabling robust base calling across the genome, including repetitive sequence regions and within homopolymers.
Nevertheless, sequencing by synthesis presents some limitations. For example, it is ultimately limited in its length of sequence read (‘read length’) because of the increasing noise over sequential incorporation and imaging cycles and, still, their read lengths remain shorter than Sanger read lengths. These limitations should be considered because they have impacted on how the sequence data are analyzed.6
The main steps of library preparation, bridge hybridization, cluster amplification and sequencing by synthesis are shown in Figure 2.3
Figure 2. The steps of next generation sequencing (adapted from Spaulding, Br J Haemat, 2020).3 The library preparation, bridge hybridization, cluster amplification and sequencing by synthesis steps are shown.
Data analysis
NGS generates large amounts of data that require substantial computational infrastructure for storage, analysis and interpretation. Thus, data analysis of NGS can be an extremely complex, time-consuming and laborious process.
The workflow of NGS data analysis includes four phases: 1) base calling, assignment of base quality scores and de-multiplexing; 2) alignment of reads to a reference assembly or sequence(s) and variant calling; 3) annotation of the variants; and 4) interpretation of clinically relevant variants.24 Nowadays, there is a huge number of bioinformatics tools and software to analyze NGS data. Most of them are algorithms specialized on one phase of the analysis, but there are also multiple pipelines that group several algorithms which analyze the data serially.
The first phase of data analysis occurs into the sequencer platform, where sequence reads are produced. It is consisted of: i) the base calling, which is the identification of the specific nucleotide present at each position in a single sequencing read; ii) the base calling accuracy measured by the Phred quality score (Q score) that indicates the probability that a given base is called incorrectly by the sequencer; and iii) the de-multiplexing that is the computational association of reads with a patient when multiple samples have been multiplexed (pooled) before sequencing.24,25
The second phase of NGS data analysis is the mapping and alignment of each read to the reference human genome assembly, which is available from the Genome Reference Consortium (the latest versions can be found at http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/). The alignment tools used by clinical laboratories include multiple algorithms that sometimes trade speed for accuracy and can incorporate different levels of sensitivity and specificity. However, it is recommended that alignment includes additional processing for local realignment, removal of PCR duplicates and recalibration of base-call quality scores. As a consequence of the alignment analysis, the next step is the variant calling that is the process of comparing the aligned reads to a reference genome to identify base pair variations. A variety of software tools and strategies are also available for variant calling, but no single software setting is currently able to identify all variant classes with equal accuracy. Thus, several variant callers and/or parameters setting should be evaluated to optimize the detection of different variant types during assay development.7,24,26
Following variant calling, the next phase is to annotate the variants in relation to genes (within or outside a gene), codon, amino acid positions, and classify types of variants, such as nonsense, missense, exonic deletions and synonymous variants. This allows for greater understanding of their functional consequences on genes they relate to. In many studies, only non-silent exonic or splicing mutations are selected for further analysis, focusing on functional coding variants and mutations only. However, these criteria may change depending on the purpose of the study, for example variants in UTRs or in promoter regions can be interesting.7
The final phase of NGS data analysis is a clinical assessment to determine which variants should be reported and how they should be described in the laboratory report. There are some algorithms that optimize settings to predict deleterious variants located in clinically relevant genes and to filter out selected variants, such as those with a high population allele frequency. Similarly, existing database, such as ClinVar, COSMIC, ClinGen, protein predictors or Human Variome Project, are valuable tools for assessing the relevance of variants and for categorizing variants according to predicted functional impact (‘often benign’, ‘likely benign’, ‘uncertain significance’, ‘likely pathogenic’ or ‘pathogenic’). Nonetheless, the interpretation of variants is a greater challenge because there is no standard format and a common understanding of variant data between laboratories. Therefore, collaborations and ongoing discussions among laboratory researchers, clinicians, manufacturers, bioinformatics, software developers and professional organizations are needed to ensure the quality of NGS tests and to guide clinical decision-making in myeloid malignancies.24,26
Despite the potential to inform and guide clinical practice, NGS has some limitations. Although the costs for NGS have recently become more affordable, they remain be high. In addition, it can take one to three weeks to receive a finalized data report, which may be too long to affect an initial treatment decision for a patient with more aggressive disease. Regarding practical considerations, NGS are not standardized and there is still a lack of gold-standard pipelines for analysis, which can lead to poor reproducibility between laboratories, even for the same data sets. Thus, there is a need for collaboration to form larger data pools in order to improve the accuracy and efficacy in variant calling for prognostic and therapeutic signatures and to infer meaningful conclusions.3,7 Moreover, though the ability of NGS to detect down to 1% VAF is an improvement from Sanger sequencing (threshold >20% VAF), interpreting VAF in a meaningful way requires context. Not all somatic gene mutations detected by NGS are guaranteed to be pathogenic mutations contributing to myeloid malignancy development and further studies are needed to establish a meaningful threshold for VAF sensitivity.3
Therefore, NGS has revolutionized genomic research and it is one of the most rising methodologies that will change the management of patients. However, it will be important to standardize sequencing and variant calling methods and continue to expand knowledge of the pathogenic and prognostic role of specific variants in order to improve the clinical utility of NGS.
Found 26 bibliography related entries:
Marta Martín Izquierdo~María Alvarado Abáigar~Rocío Benito Sánchez~Jesús María Hernández Rivas
Atlas of Genetics and Cytogenetics in Oncology and Haematology 2022-03-04
Next generation sequencing (NGS) in the MDS study
Online version: http://atlasgeneticsoncology.org/teaching/208929/next-generation-sequencing-(ngs)-in-the-mds-study