| 1 | == De novo assembly of short reads with inchworm == |
| 2 | * [http://trinityrnaseq.sourceforge.net/ Inchworm RNA-Seq Assembler] - official project page |
| 3 | * It can require large amounts of memory (with large datasets), limiting the size of the set of reads that can be assembled at once. |
| 4 | * Based on planarian 36-nt reads, this appears to work better than velvet or tophat+cufflinks. |
| 5 | * Takes fasta file as input |
| 6 | * To convert from fastq to fasta: /nfs/BaRC_Public/BaRC_code/Perl/fastq2fasta/fastq2fasta.pl |
| 7 | * USAGE: ./fastq2fasta.pl fastqFile1 [fastqFile2 …] > fastaFile |
| 8 | |
| 9 | |
| 10 | * Assemble a fasta file of Solexa reads, |
| 11 | * using regular (NOT strand-specific) RNA-Seq data [–DS] |
| 12 | * using a kmer length of 25 [-K] |
| 13 | * requiring a contig to be represented by at least 5 reads [–min_assembly_coverage] |
| 14 | |
| 15 | {{{ |
| 16 | bsub "inchworm --reads reads_sequence.fa --run_inchworm -K 25 --DS --min_assembly_coverage 5 > reads.inchworm_contigs.fa" |
| 17 | }}} |
| 18 | |
| 19 | |
| 20 | == Discovering novel genes and transcripts with tophat and cufflinks == |
| 21 | |
| 22 | Reference: [http://tophat.cbcb.umd.edu/ tophat] [http://cufflinks.cbcb.umd.edu/tutorial.html cufflinks] |
| 23 | |
| 24 | The de novo assembly worked fine in 100bp pair-end reads. For the six 40bp pair-ends reads samples in our hands, cufflinks failed at creating decent amount of junctions. For short reads (usually <45-bp), it is better to decrease segment length (–segment-length) to about half the read length and segment mismatches (–segment-mismatches) to 0 or 1. |
| 25 | |
| 26 | 1. Map the reads for each sample to the reference genome: accepted_hits.bam from tophat can be used as input for cufflinks |
| 27 | |
| 28 | |
| 29 | {{{ |
| 30 | bsub tophat -o sample1 /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 s_1_1_sequence.txt-common.out s_1_2_sequence.txt-common.out |
| 31 | bsub tophat -o sample2 /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 s_2_1_sequence.txt-common.out s_2_2_sequence.txt-common.out |
| 32 | }}} |
| 33 | |
| 34 | 2. Run Cufflinks on each mapping file: use -M to ignore all reads mapped to rRNA and mitochondrial transcripts, this will increase speed and performance. |
| 35 | |
| 36 | |
| 37 | {{{ |
| 38 | bsub cufflinks -M Mus_musculus.NCBIM37.62.noNT.rRNA.chrM.gtf sample1/accepted_hits.bam |
| 39 | bsub cufflinks -M Mus_musculus.NCBIM37.62.noNT.rRNA.chrM.gtf sample2/accepted_hits.bam |
| 40 | }}} |
| 41 | |
| 42 | 3. Merge the resulting assemblies |
| 43 | |
| 44 | |
| 45 | {{{ |
| 46 | Create a file called assemblies.txt, which lists the gtf files derived from cufflinks (above). This file should include |
| 47 | sample1/transcripts.gtf |
| 48 | sample2/transcripts.gtf |
| 49 | Run cuffmerge, a GTF merging script, which creates a merged annotation (merged_asm/merged.gtf) |
| 50 | cuffmerge -s /nfs/genomes/mouse_gp_jul_07_no_random/mouse_all_no_random.fa assemblies.txt |
| 51 | }}} |
| 52 | |
| 53 | 4. Compare the merged assembly with known or annotated genes |
| 54 | |
| 55 | |
| 56 | {{{ |
| 57 | cuffcompare -s /nfs/genomes/mouse_gp_jul_07_no_random/mouse_all_no_random.fa -r /nfs/genomes/mouse_gp_jul_07/gtf/mm9_refseq.gtf merged_asm/merged.gtf |
| 58 | }}} |
| 59 | |
| 60 | == New methods == |
| 61 | |
| 62 | We haven't yet tested |
| 63 | |
| 64 | * [http://www.broadinstitute.org/software/scripture/ Scripture] |
| 65 | * [http://www.broadinstitute.org/scientific-community/software/trinity Trinity] - includes inchworm together with other methods |
| 66 | See the [http://www.the-dream-project.org/result/alternative-splicing DREAM “alternative splicing”] challenge for bake-off description and results |
| 67 | |
| 68 | |
| 69 | == Cluster sequences with TGICL == |
| 70 | |
| 71 | |
| 72 | {{{ |
| 73 | TGICL was designed for the assembly of longer transcript fragments like ESTs. |
| 74 | It can still be useful for the multi-step assembly of large or heterogeneous transcript fragments. |
| 75 | In these cases, short read assemblers can be used as a first step to generate longer contigs (of variable lengths) which can be further assembled with TGICL. |
| 76 | }}} |
| 77 | |
| 78 | {{{ |
| 79 | Sample command: |
| 80 | bsub tgicl contig_cleaned.fa -l 40 -p 90 |
| 81 | }}} |
| 82 | |
| 83 | == Cleaning the assembled sequence == |
| 84 | |
| 85 | |
| 86 | {{{ |
| 87 | Short reads should be generally cleaned of vector/linker/primer sequences before assembly. |
| 88 | In some cases we may be pre-assembled contigs that can still contain contamination. |
| 89 | |
| 90 | Sample command: In this example, contig.fa is the output file from above assembly step |
| 91 | bsub "seqclean contig.fa -v /nfs/genomes/UniVec/UniVec_Core -o contig_cleaned.fa" |
| 92 | }}} |
| 93 | |
| 94 | |
| 95 | == Other methods we've tried == |
| 96 | |
| 97 | * Velvet wasn't very successful, at least with short planarian reads. |