== Extracting genome subsequences == ==== Considerations for all methods ==== * Beware of OBOBs (off-by-one bugs), as command-line and web tools may not use the same coordinate numbering scheme. * Check that your numbering scheme is consistent with your application. * Check the strandedness of extracted sequence (if it matters). === Web method === * Fine for a few regions (and for non command-line users) * Go to [[http://genome.ucsc.edu/|UCSC Genome Bioinformatics]], select a genome and a region * Click on DNA on top blue bar * Check "Reverse complement" box if negative strand is desired * Click on "get DNA" button === Command-line method (one region) === * Go to /nfs/genomes and select (or download) desired genome * Check that genome file(s) are in nib format. If not: * Run ''faToNib -softMask in.fa out.nib'' [where the softMask maintains lowercase (masked) genome regions] * To process a directory of fasta files: ''/nfs/BaRC_Public/BaRC_code/Perl/make_nib_files/make_nib_files.pl DIRECTORY'' * To do actual subsequence extraction, ''nibFrag [options] file.nib start end strand out.fa'' * Despite nibFrag usage statement, negative strand can be requested by "-" or by "m" === Command-line method (many regions) === * This method is based on a perl wrapper for nibFrag, such as * ''/nfs/BaRC_Public/BaRC_code/Perl/extract_genomic_regions/extract_genomic_regions.pl pathToNibFiles coordFile outFile'' * where the coordFile is of a format like **RegionIDchr1:1-12345** * and the nib files are under /nfs/genomes in a genome's 'blat directory' === UCSC Table Browser (many regions) === * Go to [[http://genome.ucsc.edu/cgi-bin/hgTables|UCSC Bioinformatics Table Browser]] * Select the desired assembly * Select regions based on browser track === Methods for extraction from short sequences === * EMBOSS's ''extractseq'' [see ''extractseq -h'' for details] -- but very slow for long sequences * Perl's ''substr'' command * BioPerl's substring command (such as ''$seqobj->subseq(1,100)'') == Extracting genome subsequences == ** This method replaces BaRC's mafExtract.pl (and related programs), which were prone to occasional bugs. ** * **Using UCSC's mafFrag command:** * .hg.conf is configured to access UCSC databases on membrane.wi.mit.edu * Copy /nfs/BaRC/apps/UCSC/hg.conf to your home directory (or ask a member of BaRC to send it to you) and rename it to .hg.conf * ''cp /nfs/BaRC/apps/UCSC/hg.conf ~/.hg.conf'' * Chmod .hg.conf so only you can read it * ''chmod 600 ~/.hg.conf'' * USAGE: ''mafFrag database mafTrack chrom start end strand out.maf'' * options: -outName=XXX Use XXX instead of database.chrom for the name * Examples: * ''mafFrag hg19 multiz46way chr1 69000 70000 + example1.maf'' * ''mafFrag mm9 multiz30way chr1 4334000 4335000 + example2.maf'' * ''mafFrag ce6 multiz6way chrI 4000 5000 + example3.maf''