Extracting genome subsequences

Considerations for all methods

  • Beware of OBOBs (off-by-one bugs), as command-line and web tools may not use the same coordinate numbering scheme.
  • Check that your numbering scheme is consistent with your application.
  • Check the strandedness of extracted sequence (if it matters).

Web method

  • Fine for a few regions (and for non command-line users)
  • Go to UCSC Genome Bioinformatics, select a genome and a region
  • Click on DNA on top blue bar
  • Check "Reverse complement" box if negative strand is desired
  • Click on "get DNA" button

Command-line method (one region)

  • Go to /nfs/genomes and select (or download) desired genome
  • Check that genome file(s) are in nib format. If not:
    • Run faToNib -softMask in.fa out.nib [where the softMask maintains lowercase (masked) genome regions]
    • To process a directory of fasta files: /nfs/BaRC_Public/BaRC_code/Perl/make_nib_files/ DIRECTORY
  • To do actual subsequence extraction, nibFrag [options] file.nib start end strand out.fa
  • Despite nibFrag usage statement, negative strand can be requested by "-" or by "m"

Command-line method (many regions)

  • This method is based on a perl wrapper for nibFrag, such as
  • /nfs/BaRC_Public/BaRC_code/Perl/extract_genomic_regions/ pathToNibFiles coordFile outFile
  • where the coordFile is of a format like RegionID<tab>chr1:1-12345
  • and the nib files are under /nfs/genomes in a genome's 'blat directory'

UCSC Table Browser (many regions)

Methods for extraction from short sequences

  • EMBOSS's extractseq [see extractseq -h for details] -- but very slow for long sequences
  • Perl's substr command
  • BioPerl's substring command (such as $seqobj->subseq(1,100))

Extracting genome subsequences

This method replaces BaRC's (and related programs), which were prone to occasional bugs.

  • Using UCSC's mafFrag command:
  • .hg.conf is configured to access UCSC databases on
  • Copy /nfs/BaRC/apps/UCSC/hg.conf to your home directory (or ask a member of BaRC to send it to you) and rename it to .hg.conf
    • cp /nfs/BaRC/apps/UCSC/hg.conf ~/.hg.conf
  • Chmod .hg.conf so only you can read it
    • chmod 600 ~/.hg.conf
  • USAGE: mafFrag database mafTrack chrom start end strand out.maf
    • options: -outName=XXX Use XXX instead of database.chrom for the name
  • Examples:
    • mafFrag hg19 multiz46way chr1 69000 70000 + example1.maf
    • mafFrag mm9 multiz30way chr1 4334000 4335000 + example2.maf
    • mafFrag ce6 multiz6way chrI 4000 5000 + example3.maf