Extracting genome subsequences
Considerations for all methods
- Beware of OBOBs (off-by-one bugs), as command-line and web tools may not use the same coordinate numbering scheme.
- Check that your numbering scheme is consistent with your application.
- Check the strandedness of extracted sequence (if it matters).
Web method
- Fine for a few regions (and for non command-line users)
- Go to UCSC Genome Bioinformatics, select a genome and a region
- Click on DNA on top blue bar
- Check "Reverse complement" box if negative strand is desired
- Click on "get DNA" button
Command-line method (one region)
- Go to /nfs/genomes and select (or download) desired genome
- Check that genome file(s) are in nib format. If not:
- Run faToNib -softMask in.fa out.nib [where the softMask maintains lowercase (masked) genome regions]
- To process a directory of fasta files: /nfs/BaRC_Public/BaRC_code/Perl/make_nib_files/make_nib_files.pl DIRECTORY
- To do actual subsequence extraction, nibFrag [options] file.nib start end strand out.fa
- Despite nibFrag usage statement, negative strand can be requested by "-" or by "m"
Command-line method (many regions)
- This method is based on a perl wrapper for nibFrag, such as
- /nfs/BaRC_Public/BaRC_code/Perl/extract_genomic_regions/extract_genomic_regions.pl pathToNibFiles coordFile outFile
- where the coordFile is of a format like RegionID<tab>chr1:1-12345
- and the nib files are under /nfs/genomes in a genome's 'blat directory'
UCSC Table Browser (many regions)
- Go to UCSC Bioinformatics Table Browser
- Select the desired assembly
- Select regions based on browser track
Methods for extraction from short sequences
- EMBOSS's extractseq [see extractseq -h for details] -- but very slow for long sequences
- Perl's substr command
- BioPerl's substring command (such as $seqobj->subseq(1,100))
Extracting genome subsequences
This method replaces BaRC's mafExtract.pl (and related programs), which were prone to occasional bugs.
- Using UCSC's mafFrag command:
- .hg.conf is configured to access UCSC databases on membrane.wi.mit.edu
- Copy /nfs/BaRC/apps/UCSC/hg.conf to your home directory (or ask a member of BaRC to send it to you) and rename it to .hg.conf
- cp /nfs/BaRC/apps/UCSC/hg.conf ~/.hg.conf
- Chmod .hg.conf so only you can read it
- chmod 600 ~/.hg.conf
- USAGE: mafFrag database mafTrack chrom start end strand out.maf
- options: -outName=XXX Use XXX instead of database.chrom for the name
- Examples:
- mafFrag hg19 multiz46way chr1 69000 70000 + example1.maf
- mafFrag mm9 multiz30way chr1 4334000 4335000 + example2.maf
- mafFrag ce6 multiz6way chrI 4000 5000 + example3.maf
Note:
See TracWiki
for help on using the wiki.