Changes between Initial Version and Version 1 of SOPs/AssemblingRNAseqReads


Ignore:
Timestamp:
01/23/13 16:49:43 (12 years ago)
Author:
trac
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SOPs/AssemblingRNAseqReads

    v1 v1  
     1== De novo assembly of short reads with inchworm ==
     2* [http://trinityrnaseq.sourceforge.net/ Inchworm RNA-Seq Assembler] - official project page
     3* It can require large amounts of memory (with large datasets), limiting the size of the set of reads that can be assembled at once.
     4* Based on planarian 36-nt reads, this appears to work better than velvet or tophat+cufflinks.
     5* Takes fasta file as input
     6 * To convert from fastq to fasta: /nfs/BaRC_Public/BaRC_code/Perl/fastq2fasta/fastq2fasta.pl
     7 * USAGE: ./fastq2fasta.pl fastqFile1 [fastqFile2 …] > fastaFile
     8
     9
     10* Assemble a fasta file of Solexa reads,
     11 * using regular (NOT strand-specific) RNA-Seq data [–DS]
     12 * using a kmer length of 25 [-K]
     13 * requiring a contig to be represented by at least 5 reads [–min_assembly_coverage]
     14 
     15{{{
     16bsub "inchworm --reads reads_sequence.fa --run_inchworm -K 25 --DS --min_assembly_coverage 5 > reads.inchworm_contigs.fa"
     17}}}
     18     
     19
     20== Discovering novel genes and transcripts with tophat and cufflinks ==
     21
     22Reference: [http://tophat.cbcb.umd.edu/ tophat] [http://cufflinks.cbcb.umd.edu/tutorial.html cufflinks]
     23
     24The de novo assembly worked fine in 100bp pair-end reads. For the six 40bp pair-ends reads samples in our hands, cufflinks failed at creating decent amount of junctions. For short reads (usually <45-bp), it is better to decrease segment length (–segment-length) to about half the read length and segment mismatches (–segment-mismatches) to 0 or 1.
     25
     261. Map the reads for each sample to the reference genome: accepted_hits.bam from tophat can be used as input for cufflinks
     27
     28 
     29{{{
     30bsub tophat -o sample1 /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 s_1_1_sequence.txt-common.out s_1_2_sequence.txt-common.out
     31bsub tophat -o sample2 /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 s_2_1_sequence.txt-common.out s_2_2_sequence.txt-common.out
     32}}}
     33
     342. Run Cufflinks on each mapping file: use -M to ignore all reads mapped to rRNA and mitochondrial transcripts, this will increase speed and performance.
     35
     36
     37{{{
     38  bsub cufflinks -M Mus_musculus.NCBIM37.62.noNT.rRNA.chrM.gtf sample1/accepted_hits.bam
     39  bsub cufflinks -M Mus_musculus.NCBIM37.62.noNT.rRNA.chrM.gtf sample2/accepted_hits.bam
     40}}}
     41
     423. Merge the resulting assemblies
     43
     44
     45{{{
     46Create a file called assemblies.txt, which lists the gtf files derived from cufflinks (above).  This file should include
     47  sample1/transcripts.gtf
     48  sample2/transcripts.gtf
     49Run cuffmerge, a GTF merging script, which creates a merged annotation (merged_asm/merged.gtf)
     50  cuffmerge -s /nfs/genomes/mouse_gp_jul_07_no_random/mouse_all_no_random.fa assemblies.txt
     51}}}
     52     
     534. Compare the merged assembly with known or annotated genes
     54
     55
     56{{{
     57  cuffcompare -s /nfs/genomes/mouse_gp_jul_07_no_random/mouse_all_no_random.fa -r /nfs/genomes/mouse_gp_jul_07/gtf/mm9_refseq.gtf merged_asm/merged.gtf
     58}}}
     59
     60== New methods ==
     61
     62We haven't yet tested
     63
     64* [http://www.broadinstitute.org/software/scripture/ Scripture]
     65* [http://www.broadinstitute.org/scientific-community/software/trinity Trinity] - includes inchworm together with other methods
     66See the [http://www.the-dream-project.org/result/alternative-splicing DREAM “alternative splicing”] challenge for bake-off description and results
     67
     68
     69== Cluster sequences with TGICL ==
     70
     71
     72{{{
     73  TGICL was designed for the assembly of longer transcript fragments like ESTs. 
     74  It can still be useful for the multi-step assembly of large or heterogeneous transcript fragments. 
     75  In these cases, short read assemblers can be used as a first step to generate longer contigs (of variable lengths) which can be further assembled with TGICL.
     76}}}
     77
     78{{{
     79  Sample command:
     80  bsub tgicl contig_cleaned.fa -l 40 -p 90
     81}}}
     82
     83== Cleaning the assembled sequence ==
     84
     85
     86{{{
     87  Short reads should be generally cleaned of vector/linker/primer sequences before assembly.
     88  In some cases we may be pre-assembled contigs that can still contain contamination.
     89 
     90  Sample command: In this example, contig.fa is the output file from above assembly step
     91  bsub "seqclean contig.fa -v /nfs/genomes/UniVec/UniVec_Core -o contig_cleaned.fa"
     92}}}
     93
     94
     95== Other methods we've tried ==
     96
     97* Velvet wasn't very successful, at least with short planarian reads.