Digital transcriptome subtraction

Source: Wikipedia, the free encyclopedia.
Fig 1. Digital Transcriptome Subtraction

Digital transcriptome subtraction (DTS) is a

infectious diseases and is best known for discovering Merkel cell polyomavirus, the suspect causative agent in Merkel-cell carcinoma.[1]

History

Using computational subtraction to discover novel pathogens was first proposed in 2002 by Meyerson et al.

proof of principle experiment, Meyerson et al. demonstrated that it was a feasible approach using Epstein–Barr virus-infected lymphocytes in post-transplant lymphoproliferative disorder (PTLD).[3]

In 2007, the term "Digital Transcriptome Subtraction" was coined by the Chang-Moore group,[4] and was used to discover Merkel cell polymavirus in Merkel-cell carcinoma.[1]

Simultaneously to the MCV discovery, this approach was used to implicate a novel arenavirus as cause of fatality in a case where three patients died of similar illnesses shortly following organ transplantations from a single donor.[5]

Method

Fig. 2. Raw transcript breakdown from sequencing 20,000 clones derived from virus-infected human tissues. Viral transcripts were present at 0.03% of the total sequence reads.[3]

Construction of cDNA library

After treatment with

E. coli
, are then transformed using the cDNA vectors and selected using a marker, the collection of transformed clones is the cDNA library. This generates a snap-shot of tissue mRNA that is stable and can be sequenced at a later stage.

Sequencing and quality control

The cDNA library must be sequenced to great depth (i.e. number of clones sequenced) in order to detect a theoretical rare pathogen sequence (Table 1), especially if the foreign sequence is novel. Chang-Moore recommend a sequencing depth of 200,000 transcripts or greater using multiple sequencing platforms.[1]

Table 1. Probability of capturing >1 viral transcript(s) in human tissue-derived libraries.[2]
% Viral 5,000 clones 10,000 clones 20,000 clones 50,000 clones
0.001% 4.9% 9.5% 18.1% 39.3%
0.01% 39.3% 32.2% 86.5% 99.3%
0.02% 63.2% 86.5% 98.2% >99.995%
0.03% 77.7% 95.5% 99.8% >99.995%
0.04% 86.5% 98.2% >99.995% >99.995%
0.1% 99.3% >99.995% >99.995% >99.995%

Stringent quality control are then applied to the raw sequences to minimize false-positive results. The initial quality screen uses several general parameters to exclude ambiguous sequences, leaving behind a dataset of high-fidelity (Hi-Fi) reads.

  • Low
    Phred score
    cutoff is used to remove low-quality end sequences. Typically, a Phred score cutoff of 20 or 30 is used to ensure 99%-99.9% accuracy in each base-calling.
  • Vector and adaptor removal.
  • Low complexity - complexity score of a sequence reflects number of identical bases in a series (homo-polymers) such as poly-dT or poly-dA.
  • Human
    repetitive DNA
    .
  • Length - parameter is dependent on the optimized read length specific to the
    sequencing technology
    that was used.
  • E. coli
    genome sequences.

BLAST to host genome

Using MEGABLAST, Hi-Fi reads are then matched to sequences in annotated databases and any positive matches are then subtracted from the dataset. Minimum hit length for a positive match of human sequence is typically 30 consecutive identical bases, which equates to a BLAST score of 60; generally, the remaining sequence is BLAST again with less stringent parameters to allow for slight mismatches (1 in 20 nucleotide). The vast majority of sequences (>99%) should be removed from the dataset at this stage.

Subtracted sequences typically include:

  • Reference human transcriptome - eliminates any known human transcripts from expression library sets.
  • Reference human genome - eliminates genes that have been missed by the annotation process and any contaminating genomic sequences during cDNA library construction.
  • Mitochondrial DNA - mitochondrial DNA are highly abundant and polymorphic due to rapid mutation rate.
  • Immunoglobulin region
    - The immunoglobulin loci is highly polymorphic and would otherwise yield false-positive due to poor alignment to the reference genome.
  • Other vertebrate sequences
  • Unannotated sequences

Analysis of "non-host" candidates

Alignment to pathogen databases

After stringent rounds of subtraction, the remaining sequences are clustered into non-redundant contigs and aligned to known pathogen sequences using low-stringency parameters. As pathogen genomes mutates quickly, nucleotide-nucleotide alignments, or

open reading frames to the amino acid sequence to annotated proteins, or blastx, is the preferred alignment method as it increases the likelihood of identifying a novel pathogen by matching to a related strain/species.[5] Experimental extension of candidate sequences might also be used at this stage to maximize chances of a positive match.[6]

De novo assembly

In cases where alignment to known pathogens is uninformative or ambiguous, contigs of candidate sequence can be used as templates for

low coverage
.

Validation of pathogen

Once a putative pathogen has been identified in the high-throughput sequencing data, it is imperative to validate the presence of pathogen in infected patients using more sensitive techniques, such as:

Applications

The primary application for DTS lies in identification of pathogenic viruses in cancer.[1][4] It can also be used to identify viral pathogens in non-cancer related disease.[5] Future clinical applications could include the use of DTS on a routine basis in individuals. DTS could also apply to

honey bees.[7]

Advantages

Disadvantages

References