De novo assembly of short reads with inchworm
- Inchworm RNA-Seq Assembler - official project page for Trinity, of which it is one part
- It can require large amounts of memory (with large datasets), limiting the size of the set of reads that can be assembled at once.
- Based on planarian 36-nt reads, this appears to work better than velvet or tophat+cufflinks.
- Takes fasta file as input
- To convert from fastq to fasta: /nfs/BaRC_Public/BaRC_code/Perl/fastq2fasta/fastq2fasta.pl
- USAGE: ./fastq2fasta.pl fastqFile1 [fastqFile2 …] > fastaFile
- Assemble a fasta file of Solexa reads,
- using regular (NOT strand-specific) RNA-Seq data [–DS]
- using a kmer length of 25 [-K]
- requiring a contig to be represented by at least 5 reads [–min_assembly_coverage]
bsub "inchworm --reads reads_sequence.fa --run_inchworm -K 25 --DS --min_assembly_coverage 5 > reads.inchworm_contigs.fa"
Discovering novel genes and transcripts with tophat and cufflinks
The de novo assembly worked fine in 100bp pair-end reads. For the six 40bp pair-ends reads samples in our hands, cufflinks failed at creating decent amount of junctions. For short reads (usually <45-bp), it is better to decrease segment length (–segment-length) to about half the read length and segment mismatches (–segment-mismatches) to 0 or 1.
- Map the reads for each sample to the reference genome. Output from TopHat (accepted_hits.bam) can be used as input for cufflinks.
See our mapping SOP for more details.
- Run Cufflinks on each mapping file: use -M to ignore all reads mapped to rRNA and mitochondrial transcripts, this will increase speed and performance.
bsub cufflinks -M Mus_musculus.NCBIM37.62.noNT.rRNA.chrM.gtf sample1/accepted_hits.bam bsub cufflinks -M Mus_musculus.NCBIM37.62.noNT.rRNA.chrM.gtf sample2/accepted_hits.bam
- Merge the resulting assemblies
Create a file called assemblies.txt, which lists the gtf files derived from cufflinks (above). This file should include sample1/transcripts.gtf sample2/transcripts.gtf Run cuffmerge, a GTF merging script, which creates a merged annotation (merged_asm/merged.gtf) cuffmerge -s /nfs/genomes/mouse_gp_jul_07_no_random/mouse_all_no_random.fa assemblies.txt
- Compare the merged assembly with known or annotated genes
cuffcompare -s /nfs/genomes/mouse_gp_jul_07_no_random/mouse_all_no_random.fa -r /nfs/genomes/mouse_gp_jul_07/gtf/mm9_refseq.gtf merged_asm/merged.gtf
We haven't yet tested
- Trinity - includes inchworm together with other methods
- StringTie - alternative to CuffLinks (from the same group/lab)
See the DREAM Alternative Splicing Challenge for bake-off description and results
Clustering sequences with TGICL
TGICL was designed for the assembly of longer transcript fragments like ESTs. It can still be useful for the multi-step assembly of large or heterogeneous transcript fragments. In these cases, short read assemblers can be used as a first step to generate longer contigs (of variable lengths) which can be further assembled with TGICL.
Sample command: bsub tgicl contig_cleaned.fa -l 40 -p 90
Cleaning the assembled sequence
Short reads should be generally cleaned of vector/linker/primer sequences before assembly. In some cases we may have pre-assembled contigs that can still contain contamination.
# Sample command: In this example, contig.fa is the output file from above assembly step bsub "seqclean contig.fa -v /nfs/genomes/UniVec/UniVec_Core -o contig_cleaned.fa"
Other methods we've tried
- Velvet wasn't very successful, at least with short planarian reads.