SOPs/AssemblingRNAseqReads – BaRC Wiki

Context Navigation

De novo assembly of short reads with inchworm

Inchworm RNA-Seq Assembler - official project page for Trinity, of which it is one part
It can require large amounts of memory (with large datasets), limiting the size of the set of reads that can be assembled at once.
Based on planarian 36-nt reads, this appears to work better than velvet or tophat+cufflinks.
Takes fasta file as input
- To convert from fastq to fasta: /nfs/BaRC_Public/BaRC_code/Perl/fastq2fasta/fastq2fasta.pl
- USAGE: ./fastq2fasta.pl fastqFile1 [fastqFile2 …] > fastaFile

Assemble a fasta file of Solexa reads,
- using regular (NOT strand-specific) RNA-Seq data [–DS]
- using a kmer length of 25 [-K]
- requiring a contig to be represented by at least 5 reads [–min_assembly_coverage]

sbatch --job-name=inchworm --mem=32G --wrap="inchworm --reads reads_sequence.fa --run_inchworm -K 25 --DS --min_assembly_coverage 5 > reads.inchworm_contigs.fa"

Discovering novel genes and transcripts with tophat and cufflinks

Reference: tophat cufflinks

The de novo assembly worked fine in 100bp pair-end reads. For the six 40bp pair-ends reads samples in our hands, cufflinks failed at creating decent amount of junctions. For short reads (usually <45-bp), it is better to decrease segment length (–segment-length) to about half the read length and segment mismatches (–segment-mismatches) to 0 or 1.

Map the reads for each sample to the reference genome. Output from TopHat (accepted_hits.bam) can be used as input for cufflinks.

See our mapping SOP for more details.

Run Cufflinks on each mapping file: use -M to ignore all reads mapped to rRNA and mitochondrial transcripts, this will increase speed and performance.

  sbatch --job-name=cufflinks_s1 --mem=32G --wrap="cufflinks -M Mus_musculus.NCBIM37.62.noNT.rRNA.chrM.gtf sample1/accepted_hits.bam"
  sbatch --job-name=cufflinks_s2 --mem=32G --wrap="cufflinks -M Mus_musculus.NCBIM37.62.noNT.rRNA.chrM.gtf sample2/accepted_hits.bam"

Merge the resulting assemblies

Create a file called assemblies.txt, which lists the gtf files derived from cufflinks (above).  This file should include
  sample1/transcripts.gtf
  sample2/transcripts.gtf
Run cuffmerge, a GTF merging script, which creates a merged annotation (merged_asm/merged.gtf)
  cuffmerge -s /nfs/genomes/mouse_gp_jul_07_no_random/mouse_all_no_random.fa assemblies.txt

Compare the merged assembly with known or annotated genes

  cuffcompare -s /nfs/genomes/mouse_gp_jul_07_no_random/mouse_all_no_random.fa -r /nfs/genomes/mouse_gp_jul_07/gtf/mm9_refseq.gtf merged_asm/merged.gtf

New methods

We haven't yet tested

Scripture
Trinity - includes inchworm together with other methods
StringTie - alternative to CuffLinks (from the same group/lab)

See the DREAM Alternative Splicing Challenge for bake-off description and results

Clustering sequences with TGICL

TGICL was designed for the assembly of longer transcript fragments like ESTs. It can still be useful for the multi-step assembly of large or heterogeneous transcript fragments. In these cases, short read assemblers can be used as a first step to generate longer contigs (of variable lengths) which can be further assembled with TGICL.

  Sample command: 
  sbatch --job-name=tgicl --mem=16G --wrap="tgicl contig_cleaned.fa -l 40 -p 90"

Cleaning the assembled sequence

Short reads should be generally cleaned of vector/linker/primer sequences before assembly. In some cases we may have pre-assembled contigs that can still contain contamination.

# Sample command: In this example, contig.fa is the output file from above assembly step
sbatch --job-name=seqclean --mem=8G --wrap="seqclean contig.fa -v /nfs/genomes/UniVec/UniVec_Core -o contig_cleaned.fa"

Other methods we've tried

Velvet wasn't very successful, at least with short planarian reads.

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text