Reference Guided Transcriptome Assembly

Many RNASeq studies involve organisms that are well-studied, or at least have a genome sequence to work with. If you have reasonable confidence in that genome sequence, then you may assume it is the source of the transcript sequences you want to study, and you can assemble a transcriptome using the genome sequence as a guide. One of the most commonly used RNASeq tools commonly uses that approach – the Tuxedo suite, including Tophat, Cufflinks, etc. There are a number of applications that work with the genome to build a transcriptome sequence set.

Reference Guided Assembly

The Tuxedo suite. This well polished set of tools allows for multiple approaches to studying gene expression, including using only existing annotation to determine what genes are expressed by an RNASeq sample. Very common use of the suite includes assembling and identifying the transcripts, and identifying the number of hits for each transcript and for the genes they represent. To do this, the user must map the reads (cleaned up already) to the genome with Tophat2, then assemble the overlapping and spaced mapping patterns in the BAM files, into exons and transcripts with Cufflinks2. Cufflinks2 outputs a .gtf file to lay out the transcripts in reference to the genome. These .gtf files can be merged into a single consolidated file and matched up with the existing genome annotation with Cuffmerge2. CuffDiff2 can then be used to determine differential expression between different conditions represented by different groups of files. Tophat2 and Cufflinks2 do the main transcriptome assembly, but Cuffmerge2 also contributes to the final arrangement of the output sequences. More info: ccb.jhu.edu/software/tophat/manual.shtml, http://cole-trapnell-lab.github.io/cufflinks/.
Bayesembler. This is a relatively new and easily-run application, but it can produce good quality transcript assemblies. It calls for Tophat2 BAM files from paired-end mapping, and so it may not function with outputs from other mappers. Apparently, it can function with PacBio reads. It uses bayesian modeling of transcripts to perform assembly. More info: http://bioinformatics-centre.github.io/bayesembler/, http://www.genomebiology.com/2014/15/10/501/abstract.
Oases. Oases is an extension of the Velvet assembler for genome assembly. Oases adds an extra step to Velvet assembly, but Velvet must be run with specific settings to work with Oases in transcriptome assembly – read tracking must be turned on in the Velvetg step. Oases can be used for de novo assembly, but for reference-guided assembly, BAM or SAM files are input in Velveth as read files, and a reference sequence is also input. IMPORTANT: the BAM or SAM files must be sorted by read name, rather than the usual coordinates. More info: http://bioweb2.pasteur.fr/docs/modules/velvet/1.1.02/Columbus_manual.pdf, https://www.ebi.ac.uk/~zerbino/oases/, http://www.embnet.org/sites/default/files/quickguides/Velvet_and_Oases-QG.pdf.

1 Learning Materials

Reference Guided Transcriptome Assembly

Reference Guided Assembly

Related articles

Filter by label