Context Navigation

← Previous Change
Wiki History
Next Change →

Changes between Initial Version and Version 1 of SOPs/AssemblingRNAseqReads

Timestamp:: 01/23/13 16:49:43 (12 years ago)
Author:: trac
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

SOPs/AssemblingRNAseqReads

               v1
+== De novo assembly of short reads with inchworm ==
+* [http://trinityrnaseq.sourceforge.net/ Inchworm RNA-Seq Assembler] - official project page
+* It can require large amounts of memory (with large datasets), limiting the size of the set of reads that can be assembled at once.
+* Based on planarian 36-nt reads, this appears to work better than velvet or tophat+cufflinks.
+* Takes fasta file as input
+ * To convert from fastq to fasta: /nfs/BaRC_Public/BaRC_code/Perl/fastq2fasta/fastq2fasta.pl
+ * USAGE: ./fastq2fasta.pl fastqFile1 [fastqFile2 …] > fastaFile
+* Assemble a fasta file of Solexa reads,
+ * using regular (NOT strand-specific) RNA-Seq data [–DS]
+ * using a kmer length of 25 [-K]
+ * requiring a contig to be represented by at least 5 reads [–min_assembly_coverage]
+{{{
+bsub "inchworm --reads reads_sequence.fa --run_inchworm -K 25 --DS --min_assembly_coverage 5 > reads.inchworm_contigs.fa"
+}}}
+== Discovering novel genes and transcripts with tophat and cufflinks ==
+Reference: [http://tophat.cbcb.umd.edu/ tophat] [http://cufflinks.cbcb.umd.edu/tutorial.html cufflinks]
+The de novo assembly worked fine in 100bp pair-end reads. For the six 40bp pair-ends reads samples in our hands, cufflinks failed at creating decent amount of junctions. For short reads (usually <45-bp), it is better to decrease segment length (–segment-length) to about half the read length and segment mismatches (–segment-mismatches) to 0 or 1.
+. Map the reads for each sample to the reference genome: accepted_hits.bam from tophat can be used as input for cufflinks
+{{{
+bsub tophat -o sample1 /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 s_1_1_sequence.txt-common.out s_1_2_sequence.txt-common.out
+bsub tophat -o sample2 /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 s_2_1_sequence.txt-common.out s_2_2_sequence.txt-common.out
+}}}
+. Run Cufflinks on each mapping file: use -M to ignore all reads mapped to rRNA and mitochondrial transcripts, this will increase speed and performance.
+{{{
+  bsub cufflinks -M Mus_musculus.NCBIM37.62.noNT.rRNA.chrM.gtf sample1/accepted_hits.bam
+  bsub cufflinks -M Mus_musculus.NCBIM37.62.noNT.rRNA.chrM.gtf sample2/accepted_hits.bam
+}}}
+. Merge the resulting assemblies
+{{{
+Create a file called assemblies.txt, which lists the gtf files derived from cufflinks (above).  This file should include
+  sample1/transcripts.gtf
+  sample2/transcripts.gtf
+Run cuffmerge, a GTF merging script, which creates a merged annotation (merged_asm/merged.gtf)
+  cuffmerge -s /nfs/genomes/mouse_gp_jul_07_no_random/mouse_all_no_random.fa assemblies.txt
+}}}
+. Compare the merged assembly with known or annotated genes
+{{{
+  cuffcompare -s /nfs/genomes/mouse_gp_jul_07_no_random/mouse_all_no_random.fa -r /nfs/genomes/mouse_gp_jul_07/gtf/mm9_refseq.gtf merged_asm/merged.gtf
+}}}
+== New methods ==
+We haven't yet tested
+* [http://www.broadinstitute.org/software/scripture/ Scripture]
+* [http://www.broadinstitute.org/scientific-community/software/trinity Trinity] - includes inchworm together with other methods
+See the [http://www.the-dream-project.org/result/alternative-splicing DREAM “alternative splicing”] challenge for bake-off description and results
+== Cluster sequences with TGICL ==
+{{{
+  TGICL was designed for the assembly of longer transcript fragments like ESTs.
+  It can still be useful for the multi-step assembly of large or heterogeneous transcript fragments.
+  In these cases, short read assemblers can be used as a first step to generate longer contigs (of variable lengths) which can be further assembled with TGICL.
+}}}
+{{{
+  Sample command:
+  bsub tgicl contig_cleaned.fa -l 40 -p 90
+}}}
+== Cleaning the assembled sequence ==
+{{{
+  Short reads should be generally cleaned of vector/linker/primer sequences before assembly.
+  In some cases we may be pre-assembled contigs that can still contain contamination.
+  Sample command: In this example, contig.fa is the output file from above assembly step
+  bsub "seqclean contig.fa -v /nfs/genomes/UniVec/UniVec_Core -o contig_cleaned.fa"
+}}}
+== Other methods we've tried ==
+* Velvet wasn't very successful, at least with short planarian reads.