wiki:FAQ

Version 15 (modified by gbell, 3 years ago) ( diff )

--

To learn about Whitehead Institute bioinformatics resources, browse the page, use the Find command in your browser or use the Search box at the top right of the page to search the questions and answers.

Frequently Asked Questions

  1. Where can I find sample blast+ commands?

  2. How can I align two sequences?

  3. How can I run BLAT locally?

  4. How can I get the promoter sequence of a gene?

  5. How can I make a list of items non-redundant?

  6. How can I access the Whitehead version of the Entrez Gene database?

  7. How can I find slides and materials from past Hot Topics talks?

  8. How can I create my own relational database?

  9. How can I download data/tracks from UCSC?

  10. How can I access BaRC Tools or know what tools are available?

  11. How can I submit a job to the LSF cluster?

  12. How can I find out what Perl modules or R packages are installed? Which version is currently installed in the server?

  13. How can I connect to tak?

  14. How can I get to my or my lab shared storage?

  15. Where can I find local BLAST databases?

  16. Where can I find genome sequences?

  17. Where can I find genomes formatted for bowtie, tophat, or blat?

  18. Where can I find known or predicted transcription factors that regulate a gene?

  19. Where can I find simple (one-liner) Unix/Perl commmands?

  20. Where can I find samples of R code?

  21. Where can I find the local mirror of the UCSC genome browser?

  22. Where can I find the local mirror of Galaxy?

  23. Where can I find R Studio on tak?

  24. Where can I find IGV download?

  25. Which software should I use to cluster, create and display heatmaps?

  26. Which software should I use to do GO enrichment analysis?

  27. Which software should I use to display a gene network?

  28. How can I get desktop software provided by Whitehead?

  29. Which software should I use to do statistics?

  30. How can I search for Pfam (protein) profiles in my protein set using HMMs?

  31. How do I install an R package locally?

  32. I need to send/receive very large data files to/from a colleague outside of Whitehead. What is the best way to do this?

  33. Why do I get different BLAST results from WI and NCBI Blast?

  34. How do I run tophat/bowtie on the LSF with a gzip'd tar (*.tar.gz) file?

  35. How can I run AlphaFold 2.0 here at Whitehead?

  36. How can I output small p-values from R?


Answers to Frequently Asked Questions

  1. Where can I find sample blast+ commands?

  2. How can I align two sequences?

  3. How can I run BLAT locally?

  4. How can I get the promoter sequence of a gene?

    • Go to the UCSC Genome Bioinformatics genome browser.
    • Choose your desired genome and enter your desired gene (in the "position or search term" box).
    • If the gene has multiple transcripts, choose the one you want.
    • Paying attention to the direction of the gene (indicated by the intron hash marks), not the coordinate of the transcription start site (TSS)
    • Enter a range of coordinates before and/or after the TSS and click on "jump".
    • When you have the desired range in the browser, click on "DNA" on the top blue bar.
    • Check the "Reverse complement" box if your gene is on the negative strand.
    • Click on the "get DNA" button.
    • If you want to check your sequence relative to the TSS, map it with BLAT.

  5. How can I make a list of items non-redundant?

  6. How can I access the Whitehead version of the Entrez Gene database?

    • Whitehead BaRC designed a local copy of the Entrez Gene database using MySQL
    • You need a MySQL client to access the database, either a desktop tool like MySQL Workbench or a tak account.
    • The information you need:
      • Hostname = devo.wi.mit.edu
      • database = entrez_gene
      • username = barc_read_only
      • password = Ask BaRC about this
    • On tak, use the command
      • mysql -u barc_read_only -h devo.wi.mit.edu -D entrez_gene -p

  7. How can I find slides and materials from past Hot Topics talks?

    • See our Hot Topics page, with links to presentations and other materials.

  8. How can I create my own relational database?

  9. How can I download data/tracks from UCSC?

    • Go to UCSC Genome Bioinformatics
    • Click on "Downloads" on the bar on the left side
    • Choose the desired species and assembly, noting that coordinates only apply to the assembly they were generated with.
    • Data from most tracks are available by following the "Annotation database" link.
    • Every file is either the actual data in a tab-delimited text file (*.txt.gz) or a small file that provides a name for each column.
    • The data files can be open in Excel or processed as text files or used to create a table in your MySQL database.
    • Note from the data of the annotation files that some are updated much more often than others.

  10. How can I access BaRC Tools or know what tools are available?

  11. How can I submit a job to the LSF cluster?

  12. How can I find out what Perl modules or R packages are installed? Which version is currently installed in the server?

  13. How can I connect to tak?

    • To connect to tak, you need a tak account and some kind of secure shell (ssh) with X Windows (to get the graphics):
    • With either system, double click on the icon to get the "command prompt", the window in which you can type commands.
    • Windows only: After opening Cygwin, start X Windows by typing "startx". A new terminal window will open, and you should use that one.
    • From the command prompt, connect to tak (or another Unix/Linux computer) with a command like
       ssh username@tak.wi.mit.edu -Y
       or
       ssh username@tak.wi.mit.edu -X
       where username is the name of your tak account.  
       You'll be prompted for your password.
      
    • In addition, see the tutorials created by IT called Install Cygwin or TakPack. IT recommends using TakPack if possible as it provides both x11 and and ssh client and is a "lighter install".

  14. How can I get to my or my lab shared storage?

    • On a Mac or Windows computer, most shared storage areas can be accessed via wi-files1 or wi-files2, although high-throughput sequencing data is accessed via wi-htdata.
    • On a Mac computer, get to a server like wi-files1/BaRC_Public (/nfs/BaRC_Public) by connecting to
      cifs://wi-files1/BaRC_Public
      
    • On a Windows computer, get to a server like wi-files1/BaRC_Public (/nfs/BaRC_Public) by connecting to
      \\wi-files1\BaRC_Public
      
    • See lab share paths to get to your lab storage area.

  15. Where can I find local BLAST databases?

    • BLAST formated databases can be found in /nfs/seq/Data on tak.

  16. Where can I find genome sequences?

    • Genome sequences can be found in /nfs/genomes on tak.

  17. Where can I find genomes formatted for bowtie, tophat, or blat?

    • Within many directories on /nfs/genomes you can these additional files.

  18. Where can I find known or predicted transcription factors that regulate a gene?

  19. Where can I find simple (one-liner) Unix/Perl commmands?

  20. Where can I find samples of R code?

    • Sample R code is available /nfs/BaRC_Public/BaRC_code.

  21. Where can I find the local mirror of the UCSC genome browser?

  22. Where can I find the local mirror of Galaxy?

  23. Where can I find R Studio on tak?

  24. Where can I find IGV download?

  25. Which software should I use to cluster, create and display heatmaps?

  26. Which software should I use to do GO enrichment analysis?

  27. Which software should I use to display a gene network?

  28. How can I get desktop software provided by Whitehead?

  29. Which software should I use to do statistics?

  30. How can I search for Pfam (protein) profiles in my protein set using HMMs?
    • A local copy of all Pfam HMMs can be found at /nfs/seq/pfam_db/Pfam-A.hmm
    • Pfam profiles can be easily searched with the HMMER suite of tools
      • If you want to annotate a set of proteins with only specific profiles, you can extract one profile at a time with hmmfetch
        • ex: hmmfetch /nfs/seq/pfam_db/Pfam-A.hmm PF01731.15 > PF01731.15.hmm
      • Use hmmsearch to search your protein set (as a multiple-sequence fasta file) with one to all profiles
        • ex1 (all profiles): hmmsearch /nfs/seq/pfam_db/Pfam-A.hmm My_proteins.fa > My_proteins.Pfam_search_out.txt
        • ex2 (selected profile): hmmsearch PF01731.15.hmm My_proteins.fa > My_proteins.PF01731.15_search_out.txt
      • For more details about HMMER, consult the HMMER User's Guide (ftp://selab.janelia.org/pub/software/hmmer3/3.0/Userguide.pdf).

  31. How do I install an R package locally?
    #Method 1:
    #download package you are interested installing, *tar.gz
    #In R command-line
    #Location of source AND where to install package
    R_libraries_path = "/home/userName/R_libs"
    # Go to where the .tar.gz library source is and install
    setwd(R_libraries_path)
    install.packages("hthgu133ahsentrezgcdf_12.0.0.tar.gz", lib=R_libraries_path, repos=NULL)
    
    #Method 2:
    #In bash shell, use directory called R, or whatever you'd like
    export R_LIBS="$HOME/R"
    #In R command line
    source("http://www.bioconductor.org/biocLite.R")
    biocLite("pd.mogene.1.0.st.v1")
    
  32. I need to send/receive very large data files to/from a colleague outside of Whitehead. What is the best way to do this?
  33. Why do I get different BLAST results from http://tak.wi.mit.edu/blast/ and NCBI Blast?
    • WI and NCBI pages have very different defaults. As a result, a hit at WI with an e-value of 1e-12 but at NCBI the alignment is completely different and leads to an e-value of 2e-98.
    • Word sizes are different, as are match/mismatch scores (1/-3 at WI, 2/-3 at NCBI).
    • Blast at WI filters (hard masks) for low complexity by default, whereas at NCBI it doesn't, instead using the "L;m;" filter string, which filters for low-complexity for the lookup table but not for extension. This can have a huge effect.
    • Even at NCBI choosing (a) Human RefSeq sequences or (b) all RefSeq sequences and then filtering for human only also produce somewhat different results. Also, the size of the database has a big effect on the e-values. For one example query that produces the same alignment in three different databases, the e-values are very different:
      • human RefSeq sequences => 0.008
      • all RefSeq sequences => 0.47
      • nt => 2.9
  34. How do I run tophat/bowtie on the LSF with a gzip'd tar (*.tar.gz) file?
    • bsub bash -c "tophat ... <(tar xvzfO ...) <(tar xvzfO ...)", this is using process substitution
      • eg. bsub bash -c "tophat -p 10 -g 1 -o mapped_data_SRR905147_unique -N 2 -I 10000 --segment-length 25 --segment-mismatches 2 hg19 <(tar xvzfO s_2_1_sequence.txt.tar.gz ACTTGA-s_2_1_sequence.txt) <(tar xvzfO CAGATC-s_2_1_sequence.txt.tar.gz CAGATC-s_2_1_sequence.txt)"
  35. How can I run AlphaFold 2.0 here at Whitehead?
    • You'll need access to a system with a GPU, or the new GPU queue. IT can help you choose and obtain access to GPU systems if required.
    • For a working command to be executed on fry.wi.mit.edu, which then sends the AlphaFold command to the GPU node on the slurm cluster, see /nfs/BaRC_Public/BaRC_code/shell/run_AlphaFold/Commands.sh
    • Start by copying RunAlphaFold_slurm.sh to your project directory.
    • The main command looks like
      sbatch -J AF_1 --export=ALL,FASTA_NAME=Sample_protein_1.fa,USERNAME=myUsername,FASTA_PATH=proteins,AF2_WORK_DIR=/nfs/BaRC_Public/BaRC_code/shell/run_AlphaFold ./RunAlphaFold_slurm.sh
      
      where the inputs are
      • AF2_WORK_DIR => project working directory
      • FASTA_PATH => directory within AF2_WORK_DIR with input protein sequence
      • FASTA_NAME => name of input protein sequence file within $AF2_WORK_DIR/$FASTA_PATH
      • USERNAME => Username for job submission and email
    • More information, including an explanation of the output files, is here: https://github.com/deepmind/alphafold
  36. How can I output small p-values from R?
    • R often prints small p-values as "p-value < 2.2e-16" whereas the output from the statistical test has actually calculated a much more accurate p-value.
    • To access the exact p-value, explicitly call the value from the test output. For example
      a = jitter(rep(1, 20))
      b = jitter(rep(3, 20))
      t.test(a,b)  # p-value < 2.2e-16
      t.test(a,b)$p.value  # 9.0e-40 (or another small value, dependent on the output from jitter()
      
    • Other statistical tests may save the p-value with a different name. Try names(OBJECT) to see your choices and figure out how to access the exact value.
Note: See TracWiki for help on using the wiki.