== De novo assembly of short reads with inchworm == * [https://github.com/trinityrnaseq/trinityrnaseq/wiki Inchworm RNA-Seq Assembler] - official project page for Trinity, of which it is one part * It can require large amounts of memory (with large datasets), limiting the size of the set of reads that can be assembled at once. * Based on planarian 36-nt reads, this appears to work better than velvet or tophat+cufflinks. * Takes fasta file as input * To convert from fastq to fasta: /nfs/BaRC_Public/BaRC_code/Perl/fastq2fasta/fastq2fasta.pl * USAGE: ./fastq2fasta.pl fastqFile1 [fastqFile2 …] > fastaFile * Assemble a fasta file of Solexa reads, * using regular (NOT strand-specific) RNA-Seq data [–DS] * using a kmer length of 25 [-K] * requiring a contig to be represented by at least 5 reads [–min_assembly_coverage] {{{ bsub "inchworm --reads reads_sequence.fa --run_inchworm -K 25 --DS --min_assembly_coverage 5 > reads.inchworm_contigs.fa" }}} == Discovering novel genes and transcripts with tophat and cufflinks == Reference: [http://ccb.jhu.edu/software/tophat/index.shtml tophat] [http://cole-trapnell-lab.github.io/cufflinks/ cufflinks] The de novo assembly worked fine in 100bp pair-end reads. For the six 40bp pair-ends reads samples in our hands, cufflinks failed at creating decent amount of junctions. For short reads (usually <45-bp), it is better to decrease segment length (–segment-length) to about half the read length and segment mismatches (–segment-mismatches) to 0 or 1. 1. Map the reads for each sample to the reference genome. Output from TopHat (accepted_hits.bam) can be used as input for cufflinks. See our [[http://barcwiki.wi.mit.edu/wiki/SOPs/mapping|mapping SOP]] for more details. 2. Run Cufflinks on each mapping file: use -M to ignore all reads mapped to rRNA and mitochondrial transcripts, this will increase speed and performance. {{{ bsub cufflinks -M Mus_musculus.NCBIM37.62.noNT.rRNA.chrM.gtf sample1/accepted_hits.bam bsub cufflinks -M Mus_musculus.NCBIM37.62.noNT.rRNA.chrM.gtf sample2/accepted_hits.bam }}} 3. Merge the resulting assemblies {{{ Create a file called assemblies.txt, which lists the gtf files derived from cufflinks (above). This file should include sample1/transcripts.gtf sample2/transcripts.gtf Run cuffmerge, a GTF merging script, which creates a merged annotation (merged_asm/merged.gtf) cuffmerge -s /nfs/genomes/mouse_gp_jul_07_no_random/mouse_all_no_random.fa assemblies.txt }}} 4. Compare the merged assembly with known or annotated genes {{{ cuffcompare -s /nfs/genomes/mouse_gp_jul_07_no_random/mouse_all_no_random.fa -r /nfs/genomes/mouse_gp_jul_07/gtf/mm9_refseq.gtf merged_asm/merged.gtf }}} == New methods == We haven't yet tested * [http://www.broadinstitute.org/software/scripture/ Scripture] * [https://github.com/trinityrnaseq/trinityrnaseq/wiki Trinity] - includes inchworm together with other methods * [https://ccb.jhu.edu/software/stringtie/index.shtml StringTie] - alternative to CuffLinks (from the same group/lab) See the [https://www.synapse.org/Portal.html#!Synapse:syn2817724/wiki/70952 DREAM Alternative Splicing Challenge] for bake-off description and results == Clustering sequences with TGICL == TGICL was designed for the assembly of longer transcript fragments like ESTs. It can still be useful for the multi-step assembly of large or heterogeneous transcript fragments. In these cases, short read assemblers can be used as a first step to generate longer contigs (of variable lengths) which can be further assembled with TGICL. {{{ Sample command: bsub tgicl contig_cleaned.fa -l 40 -p 90 }}} == Cleaning the assembled sequence == Short reads should be generally cleaned of vector/linker/primer sequences before assembly. In some cases we may have pre-assembled contigs that can still contain contamination. {{{ # Sample command: In this example, contig.fa is the output file from above assembly step bsub "seqclean contig.fa -v /nfs/genomes/UniVec/UniVec_Core -o contig_cleaned.fa" }}} == Other methods we've tried == * Velvet wasn't very successful, at least with short planarian reads.