== De novo assembly of short reads with inchworm ==
* [https://github.com/trinityrnaseq/trinityrnaseq/wiki Inchworm RNA-Seq Assembler] - official project page for Trinity, of which it is one part
* It can require large amounts of memory (with large datasets), limiting the size of the set of reads that can be assembled at once.
* Based on planarian 36-nt reads, this appears to work better than velvet or tophat+cufflinks.
* Takes fasta file as input
 * To convert from fastq to fasta: /nfs/BaRC_Public/BaRC_code/Perl/fastq2fasta/fastq2fasta.pl
 * USAGE: ./fastq2fasta.pl fastqFile1 [fastqFile2 …] > fastaFile


* Assemble a fasta file of Solexa reads,
 * using regular (NOT strand-specific) RNA-Seq data [–DS]
 * using a kmer length of 25 [-K]
 * requiring a contig to be represented by at least 5 reads [–min_assembly_coverage]
 
{{{
bsub "inchworm --reads reads_sequence.fa --run_inchworm -K 25 --DS --min_assembly_coverage 5 > reads.inchworm_contigs.fa"
}}}
      

== Discovering novel genes and transcripts with tophat and cufflinks ==

Reference: [http://ccb.jhu.edu/software/tophat/index.shtml tophat] [http://cole-trapnell-lab.github.io/cufflinks/ cufflinks]

The de novo assembly worked fine in 100bp pair-end reads. For the six 40bp pair-ends reads samples in our hands, cufflinks failed at creating decent amount of junctions. For short reads (usually <45-bp), it is better to decrease segment length (–segment-length) to about half the read length and segment mismatches (–segment-mismatches) to 0 or 1.

1. Map the reads for each sample to the reference genome.  Output from TopHat (accepted_hits.bam) can be used as input for cufflinks.

  See our [[http://barcwiki.wi.mit.edu/wiki/SOPs/mapping|mapping SOP]] for more details.

2. Run Cufflinks on each mapping file: use -M to ignore all reads mapped to rRNA and mitochondrial transcripts, this will increase speed and performance.


{{{
  bsub cufflinks -M Mus_musculus.NCBIM37.62.noNT.rRNA.chrM.gtf sample1/accepted_hits.bam
  bsub cufflinks -M Mus_musculus.NCBIM37.62.noNT.rRNA.chrM.gtf sample2/accepted_hits.bam
}}}

3. Merge the resulting assemblies


{{{
Create a file called assemblies.txt, which lists the gtf files derived from cufflinks (above).  This file should include
  sample1/transcripts.gtf
  sample2/transcripts.gtf
Run cuffmerge, a GTF merging script, which creates a merged annotation (merged_asm/merged.gtf)
  cuffmerge -s /nfs/genomes/mouse_gp_jul_07_no_random/mouse_all_no_random.fa assemblies.txt
}}}
      
4. Compare the merged assembly with known or annotated genes


{{{
  cuffcompare -s /nfs/genomes/mouse_gp_jul_07_no_random/mouse_all_no_random.fa -r /nfs/genomes/mouse_gp_jul_07/gtf/mm9_refseq.gtf merged_asm/merged.gtf
}}}

== New methods == 

We haven't yet tested

* [http://www.broadinstitute.org/software/scripture/ Scripture]
* [https://github.com/trinityrnaseq/trinityrnaseq/wiki Trinity] - includes inchworm together with other methods
* [https://ccb.jhu.edu/software/stringtie/index.shtml StringTie] - alternative to CuffLinks (from the same group/lab)

See the [https://www.synapse.org/Portal.html#!Synapse:syn2817724/wiki/70952 DREAM Alternative Splicing Challenge] for bake-off description and results


== Clustering sequences with TGICL ==

TGICL was designed for the assembly of longer transcript fragments like ESTs.  
It can still be useful for the multi-step assembly of large or heterogeneous transcript fragments.  
In these cases, short read assemblers can be used as a first step to generate longer contigs (of variable lengths) which can be further assembled with TGICL.

{{{
  Sample command: 
  bsub tgicl contig_cleaned.fa -l 40 -p 90
}}}

== Cleaning the assembled sequence ==

Short reads should be generally cleaned of vector/linker/primer sequences before assembly.
In some cases we may have pre-assembled contigs that can still contain contamination.

{{{
# Sample command: In this example, contig.fa is the output file from above assembly step
bsub "seqclean contig.fa -v /nfs/genomes/UniVec/UniVec_Core -o contig_cleaned.fa"
}}}


== Other methods we've tried ==

* Velvet wasn't very successful, at least with short planarian reads.