| | 1 | == De novo assembly of short reads with inchworm == |
| | 2 | * [http://trinityrnaseq.sourceforge.net/ Inchworm RNA-Seq Assembler] - official project page |
| | 3 | * It can require large amounts of memory (with large datasets), limiting the size of the set of reads that can be assembled at once. |
| | 4 | * Based on planarian 36-nt reads, this appears to work better than velvet or tophat+cufflinks. |
| | 5 | * Takes fasta file as input |
| | 6 | * To convert from fastq to fasta: /nfs/BaRC_Public/BaRC_code/Perl/fastq2fasta/fastq2fasta.pl |
| | 7 | * USAGE: ./fastq2fasta.pl fastqFile1 [fastqFile2 …] > fastaFile |
| | 8 | |
| | 9 | |
| | 10 | * Assemble a fasta file of Solexa reads, |
| | 11 | * using regular (NOT strand-specific) RNA-Seq data [–DS] |
| | 12 | * using a kmer length of 25 [-K] |
| | 13 | * requiring a contig to be represented by at least 5 reads [–min_assembly_coverage] |
| | 14 | |
| | 15 | {{{ |
| | 16 | bsub "inchworm --reads reads_sequence.fa --run_inchworm -K 25 --DS --min_assembly_coverage 5 > reads.inchworm_contigs.fa" |
| | 17 | }}} |
| | 18 | |
| | 19 | |
| | 20 | == Discovering novel genes and transcripts with tophat and cufflinks == |
| | 21 | |
| | 22 | Reference: [http://tophat.cbcb.umd.edu/ tophat] [http://cufflinks.cbcb.umd.edu/tutorial.html cufflinks] |
| | 23 | |
| | 24 | The de novo assembly worked fine in 100bp pair-end reads. For the six 40bp pair-ends reads samples in our hands, cufflinks failed at creating decent amount of junctions. For short reads (usually <45-bp), it is better to decrease segment length (–segment-length) to about half the read length and segment mismatches (–segment-mismatches) to 0 or 1. |
| | 25 | |
| | 26 | 1. Map the reads for each sample to the reference genome: accepted_hits.bam from tophat can be used as input for cufflinks |
| | 27 | |
| | 28 | |
| | 29 | {{{ |
| | 30 | bsub tophat -o sample1 /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 s_1_1_sequence.txt-common.out s_1_2_sequence.txt-common.out |
| | 31 | bsub tophat -o sample2 /nfs/genomes/mouse_gp_jul_07_no_random/bowtie/mm9 s_2_1_sequence.txt-common.out s_2_2_sequence.txt-common.out |
| | 32 | }}} |
| | 33 | |
| | 34 | 2. Run Cufflinks on each mapping file: use -M to ignore all reads mapped to rRNA and mitochondrial transcripts, this will increase speed and performance. |
| | 35 | |
| | 36 | |
| | 37 | {{{ |
| | 38 | bsub cufflinks -M Mus_musculus.NCBIM37.62.noNT.rRNA.chrM.gtf sample1/accepted_hits.bam |
| | 39 | bsub cufflinks -M Mus_musculus.NCBIM37.62.noNT.rRNA.chrM.gtf sample2/accepted_hits.bam |
| | 40 | }}} |
| | 41 | |
| | 42 | 3. Merge the resulting assemblies |
| | 43 | |
| | 44 | |
| | 45 | {{{ |
| | 46 | Create a file called assemblies.txt, which lists the gtf files derived from cufflinks (above). This file should include |
| | 47 | sample1/transcripts.gtf |
| | 48 | sample2/transcripts.gtf |
| | 49 | Run cuffmerge, a GTF merging script, which creates a merged annotation (merged_asm/merged.gtf) |
| | 50 | cuffmerge -s /nfs/genomes/mouse_gp_jul_07_no_random/mouse_all_no_random.fa assemblies.txt |
| | 51 | }}} |
| | 52 | |
| | 53 | 4. Compare the merged assembly with known or annotated genes |
| | 54 | |
| | 55 | |
| | 56 | {{{ |
| | 57 | cuffcompare -s /nfs/genomes/mouse_gp_jul_07_no_random/mouse_all_no_random.fa -r /nfs/genomes/mouse_gp_jul_07/gtf/mm9_refseq.gtf merged_asm/merged.gtf |
| | 58 | }}} |
| | 59 | |
| | 60 | == New methods == |
| | 61 | |
| | 62 | We haven't yet tested |
| | 63 | |
| | 64 | * [http://www.broadinstitute.org/software/scripture/ Scripture] |
| | 65 | * [http://www.broadinstitute.org/scientific-community/software/trinity Trinity] - includes inchworm together with other methods |
| | 66 | See the [http://www.the-dream-project.org/result/alternative-splicing DREAM “alternative splicing”] challenge for bake-off description and results |
| | 67 | |
| | 68 | |
| | 69 | == Cluster sequences with TGICL == |
| | 70 | |
| | 71 | |
| | 72 | {{{ |
| | 73 | TGICL was designed for the assembly of longer transcript fragments like ESTs. |
| | 74 | It can still be useful for the multi-step assembly of large or heterogeneous transcript fragments. |
| | 75 | In these cases, short read assemblers can be used as a first step to generate longer contigs (of variable lengths) which can be further assembled with TGICL. |
| | 76 | }}} |
| | 77 | |
| | 78 | {{{ |
| | 79 | Sample command: |
| | 80 | bsub tgicl contig_cleaned.fa -l 40 -p 90 |
| | 81 | }}} |
| | 82 | |
| | 83 | == Cleaning the assembled sequence == |
| | 84 | |
| | 85 | |
| | 86 | {{{ |
| | 87 | Short reads should be generally cleaned of vector/linker/primer sequences before assembly. |
| | 88 | In some cases we may be pre-assembled contigs that can still contain contamination. |
| | 89 | |
| | 90 | Sample command: In this example, contig.fa is the output file from above assembly step |
| | 91 | bsub "seqclean contig.fa -v /nfs/genomes/UniVec/UniVec_Core -o contig_cleaned.fa" |
| | 92 | }}} |
| | 93 | |
| | 94 | |
| | 95 | == Other methods we've tried == |
| | 96 | |
| | 97 | * Velvet wasn't very successful, at least with short planarian reads. |