De novo assembly of short reads with inchworm

  • Inchworm RNA-Seq Assembler - official project page for Trinity, of which it is one part
  • It can require large amounts of memory (with large datasets), limiting the size of the set of reads that can be assembled at once.
  • Based on planarian 36-nt reads, this appears to work better than velvet or tophat+cufflinks.
  • Takes fasta file as input
    • To convert from fastq to fasta: /nfs/BaRC_Public/BaRC_code/Perl/fastq2fasta/
    • USAGE: ./ fastqFile1 [fastqFile2 …] > fastaFile
  • Assemble a fasta file of Solexa reads,
    • using regular (NOT strand-specific) RNA-Seq data [–DS]
    • using a kmer length of 25 [-K]
    • requiring a contig to be represented by at least 5 reads [–min_assembly_coverage]

bsub "inchworm --reads reads_sequence.fa --run_inchworm -K 25 --DS --min_assembly_coverage 5 > reads.inchworm_contigs.fa"

Reference: tophat cufflinks

The de novo assembly worked fine in 100bp pair-end reads. For the six 40bp pair-ends reads samples in our hands, cufflinks failed at creating decent amount of junctions. For short reads (usually <45-bp), it is better to decrease segment length (–segment-length) to about half the read length and segment mismatches (–segment-mismatches) to 0 or 1.

  1. Map the reads for each sample to the reference genome. Output from TopHat (accepted_hits.bam) can be used as input for cufflinks.

See our mapping SOP for more details.

  1. Run Cufflinks on each mapping file: use -M to ignore all reads mapped to rRNA and mitochondrial transcripts, this will increase speed and performance.
  bsub cufflinks -M Mus_musculus.NCBIM37.62.noNT.rRNA.chrM.gtf sample1/accepted_hits.bam
  bsub cufflinks -M Mus_musculus.NCBIM37.62.noNT.rRNA.chrM.gtf sample2/accepted_hits.bam
  1. Merge the resulting assemblies
Create a file called assemblies.txt, which lists the gtf files derived from cufflinks (above).  This file should include
Run cuffmerge, a GTF merging script, which creates a merged annotation (merged_asm/merged.gtf)
  cuffmerge -s /nfs/genomes/mouse_gp_jul_07_no_random/mouse_all_no_random.fa assemblies.txt

  1. Compare the merged assembly with known or annotated genes
  cuffcompare -s /nfs/genomes/mouse_gp_jul_07_no_random/mouse_all_no_random.fa -r /nfs/genomes/mouse_gp_jul_07/gtf/mm9_refseq.gtf merged_asm/merged.gtf

New methods

We haven't yet tested

See the DREAM Alternative Splicing Challenge for bake-off description and results

Clustering sequences with TGICL

TGICL was designed for the assembly of longer transcript fragments like ESTs. It can still be useful for the multi-step assembly of large or heterogeneous transcript fragments. In these cases, short read assemblers can be used as a first step to generate longer contigs (of variable lengths) which can be further assembled with TGICL.

  Sample command: 
  bsub tgicl contig_cleaned.fa -l 40 -p 90

Cleaning the assembled sequence

Short reads should be generally cleaned of vector/linker/primer sequences before assembly. In some cases we may have pre-assembled contigs that can still contain contamination.

# Sample command: In this example, contig.fa is the output file from above assembly step
bsub "seqclean contig.fa -v /nfs/genomes/UniVec/UniVec_Core -o contig_cleaned.fa"

Other methods we've tried

  • Velvet wasn't very successful, at least with short planarian reads.
Note: See TracWiki for help on using the wiki.