FAQ – BaRC Wiki

Context Navigation

To learn about Whitehead Institute bioinformatics resources, browse the page, use the Find command in your browser or use the Search box at the top right of the page to search the questions and answers.

Frequently Asked Questions

Where can I find sample blast+ commands?
How can I align two sequences?
How can I run BLAT locally?
How can I get the promoter sequence of a gene?
How can I make a list of items non-redundant?
How can I access the Whitehead version of the Entrez Gene database?
How can I find slides and materials from past Hot Topics talks?
How can I create my own relational database?
How can I download data/tracks from UCSC?
How can I access BaRC Tools or know what tools are available?
How can I submit a job to the slurm cluster?
How can I find out what Perl modules or R packages are installed? Which version is currently installed in the server?
How can I connect to fry?
How can I get to my or my lab shared storage?
Where can I find local BLAST databases?
Where can I find genome sequences?
Where can I find genomes formatted for bowtie, tophat, or blat?
Where can I find known or predicted transcription factors that regulate a gene?
Where can I find simple (one-liner) Unix/Perl commmands?
Where can I find samples of R code?
Where can I find the local mirror of the UCSC genome browser?
Where can I find the local mirror of Galaxy?
Where can I find a local R Studio server?
Where can I find IGV download?
Which software should I use to cluster, create and display heatmaps?
Which software should I use to do GO enrichment analysis?
Which software should I use to display a gene network?
How can I get desktop software provided by Whitehead?
Which software should I use to do statistics?
How can I search for Pfam (protein) profiles in my protein set using HMMs?
How do I install an R package locally?
I need to send/receive very large data files to/from a colleague outside of Whitehead. What is the best way to do this?
Why do I get different BLAST results from WI and NCBI Blast?
How do I run tophat/bowtie on the LSF with a gzip'd tar (*.tar.gz) file?
How can I run AlphaFold 2.0-2.3 here at Whitehead?
How can I run AlphaFold 3 here at Whitehead?
How can I output small p-values from R?
How can I load the earlier version of Seurat in R?

Answers to Frequently Asked Questions

Where can I find sample blast+ commands?
- See BLAST+ tips
How can I align two sequences?
- Use an EMBOSS program (http://bioinfo.wi.mit.edu/bio/tools/emboss/) for an optimal alignment
  - water for a Smith-Waterman optimal local alignment
  - needle for a Needleman-Wunsch optimal global alignment
  - stretcher for a Needleman-Wunsch optimal global alignment (optimized for longer sequences)
- Use blast2seq https://blast.ncbi.nlm.nih.gov/Blast.cgi?BLAST_SPEC=blast2seq&LINK_LOC=align2seq&PAGE_TYPE=BlastSearch for a quick local alignment
How can I run BLAT locally?
- See our Using BLAT on fry page.
How can I get the promoter sequence of a gene?
- Go to the UCSC Genome Bioinformatics genome browser.
- Choose your desired genome and enter your desired gene (in the "position or search term" box).
- If the gene has multiple transcripts, choose the one you want.
- Paying attention to the direction of the gene (indicated by the intron hash marks), not the coordinate of the transcription start site (TSS)
- Enter a range of coordinates before and/or after the TSS and click on "jump".
- When you have the desired range in the browser, click on "DNA" on the top blue bar.
- Check the "Reverse complement" box if your gene is on the negative strand.
- Click on the "get DNA" button.
- If you want to check your sequence relative to the TSS, map it with BLAT.
How can I make a list of items non-redundant?
- See our Redundant List Analysis page, which also counts how many times each item appears in your list.
How can I access the Whitehead version of the Entrez Gene database?
- Whitehead BaRC designed a local copy of the Entrez Gene database using MySQL
- You need a MySQL client to access the database, either a desktop tool like MySQL Workbench or a fry account.
- The information you need:
  - Hostname = devo.wi.mit.edu
  - database = entrez_gene
  - username = barc_read_only
  - password = Ask BaRC about this
- On fry, use the command
  - mysql -u barc_read_only -h devo.wi.mit.edu -D entrez_gene -p
How can I find slides and materials from past Hot Topics talks?
- See our Hot Topics page, with links to presentations and other materials.
How can I create my own relational database?
- You have the choice of creating a MySQL on your own computer or using Whitehead's MySQL server (canna, which would generally be more robust, if that's needed).
- If you'd like your own installation, download MySQL and install it.
- If you'd like to use devo, email callcenter@wi.mit.edu and request a database on devo. Once the IT group creates the database, you will be free to add tables and data.
- Regardless of the system see the MySQL Reference Manual and our past BaRC presentations about MySQL
  - Relational Databases For Biologists
  - Querying Biological Databases with SQL
How can I download data/tracks from UCSC?
- Go to UCSC Genome Bioinformatics
- Click on "Downloads" on the bar on the left side
- Choose the desired species and assembly, noting that coordinates only apply to the assembly they were generated with.
- Data from most tracks are available by following the "Annotation database" link.
- Every file is either the actual data in a tab-delimited text file (*.txt.gz) or a small file that provides a name for each column.
- The data files can be open in Excel or processed as text files or used to create a table in your MySQL database.
- Note from the data of the annotation files that some are updated much more often than others.
How can I access BaRC Tools or know what tools are available?
- BaRC Tools can be found on BaRC Tools. Available tools are summarized in Summary
How can I submit a job to the slurm cluster?
- Usually you just need to wrap your usual command with syntax like

sbatch --partition=20 --job-name=MY_JOB --mem=32G --wrap "MY COMMAND"

where

'sbatch' is the main command
'partition' indicates that you want to use Ubuntu 20 servers (with the most up-to-date software)
'job-name' is any short one-word name so you can more easily monitor its progress
'mem' is a required parameter requesting a specific amount of memory, and
'wrap' means that 'sbatch will wrap the specified command string in a simple "sh" shell script, and submit that script to the slurm controller'.

See IT's Using the slurm cluster for sample commands, changes in the environment (compared to the older LSF cluster) and links to more documentation

How can I find out what Perl modules or R packages are installed? Which version is currently installed in the server?
- There are links on the home page of Trac, our Tak tracking system to installed packaged software, installed Perl modules, installed Python modules, and installed R modules.
How can I connect to tak?
- To connect to tak, you need a tak account and some kind of secure shell (ssh) with X Windows (to get the graphics):
  - On a Macintosh, use x11 or Terminal
  - On a Windows computer, we recommend Cygwin/X. You can also use the Whitehead IT TakPack installer.
- With either system, double click on the icon to get the "command prompt", the window in which you can type commands.
- Windows only: After opening Cygwin, start X Windows by typing "startx". A new terminal window will open, and you should use that one.
- From the command prompt, connect to fry (or another Unix/Linux computer) with a command like
```
 ssh username@fry.wi.mit.edu -Y
 or
 ssh username@fry.wi.mit.edu -X
 where username is the name of your fry account.  
 You'll be prompted for your password.
```
- In addition, see the tutorials created by IT called Install Cygwin or TakPack. IT recommends using TakPack if possible as it provides both x11 and and ssh client and is a "lighter install".
How can I get to my or my lab shared storage?
- On a Mac or Windows computer, most shared storage areas can be accessed via wi-files1 or wi-files2, although high-throughput sequencing data is accessed via wi-htdata.
- On a Mac computer, get to a server like wi-files1/BaRC_Public (/nfs/BaRC_Public) by connecting to
```
cifs://wi-files1/BaRC_Public
```
- On a Windows computer, get to a server like wi-files1/BaRC_Public (/nfs/BaRC_Public) by connecting to
```
\\wi-files1\BaRC_Public
```
- See lab share paths to get to your lab storage area.
Where can I find local BLAST databases?
- BLAST formated databases can be found in /nfs/seq/Data on fry.
Where can I find genome sequences?
- Genome sequences can be found in /nfs/genomes on fry.
Where can I find genomes formatted for bowtie, tophat, or blat?
- Within many directories on /nfs/genomes you can these additional files.
Where can I find known or predicted transcription factors that regulate a gene?
- We do not have access to BIOBASE Knowledge Library, however, an (older) command-line version is available. See BaRC SOPs for more info.
- GeneGO (Login Required) can be used as well to find known TFs
Where can I find simple (one-liner) Unix/Perl commmands?
- There is a helpful list of Unix and Perl commands at http://bioinfo.wi.mit.edu/bio/bioinfo/scripts/.
Where can I find samples of R code?
- Sample R code is available /nfs/BaRC_Public/BaRC_code.
Where can I find the local mirror of the UCSC genome browser?
- http://ucsc.wi.mit.edu/
- To add tracks via files, copy the files to /nfs/solexa_ucsc and then access it using the URL http://weblinks.wi.mit.edu/solexa_ucsc/ to submit tracks for the browser
Where can I find the local mirror of Galaxy?
- It used to be at https://galaxy.wi.mit.edu/ but we no longer provide Galaxy.
Where can I find a local R Studio server?
- https://rstudio.wi.mit.edu
Where can I find IGV download?
- http://www.broadinstitute.org/software/igv/log-in
Which software should I use to cluster, create and display heatmaps?
- Cluster 3.0 http://bonsai.hgc.jp/~mdehoon/software/cluster/software.htm
- Java Treeview http://jtreeview.sourceforge.net/
Which software should I use to do GO enrichment analysis?
- DAVID http://david.abcc.ncifcrf.gov/ - our favorite tool for entering a list of genes of interest
- GSEA http://www.broadinstitute.org/gsea/index.jsp - a Java application that can take a ranked list of all your genes as input; our favorite tool for including all genes
- BIOBASE https://portal.biobase-international.com/cgi-bin/portal/login.cgi
- BiNGO (within Cytoscape http://www.cytoscape.org/)
- GoMiner http://discover.nci.nih.gov/gominer
- GOstat http://gostat.wehi.edu.au
- GeneOntology.org has a more complete list.
- Also see the Hot Topics talk "Gene list enrichment analysis" for more information.
Which software should I use to display a gene network?
- Cytoscape http://www.cytoscape.org/ is our favorite tool for this.
How can I get desktop software provided by Whitehead?
- Software is available through the Whitehead Software database. To see a list of desktop software, see http://it.wi.mit.edu/software/get-software.
Which software should I use to do statistics?
- GraphPad Prism (in the Whitehead software database) -- has an easy-to-use GUI and excellent practical documentation
- MatLab (in the https://software.wi.mit.edu/ Whitehead software database])
- R (free download from http://www.r-project.org/)
- Also see the BaRC Hot Topics on An Introduction to GraphPad Prism - statistics and graphing software
How can I search for Pfam (protein) profiles in my protein set using HMMs?
- A local copy of all Pfam HMMs can be found at /nfs/seq/pfam_db/Pfam-A.hmm
- Pfam profiles can be easily searched with the HMMER suite of tools
  - If you want to annotate a set of proteins with only specific profiles, you can extract one profile at a time with hmmfetch
    - ex: hmmfetch /nfs/seq/pfam_db/Pfam-A.hmm PF01731.15 > PF01731.15.hmm
  - Use hmmsearch to search your protein set (as a multiple-sequence fasta file) with one to all profiles
    - ex1 (all profiles): hmmsearch /nfs/seq/pfam_db/Pfam-A.hmm My_proteins.fa > My_proteins.Pfam_search_out.txt
    - ex2 (selected profile): hmmsearch PF01731.15.hmm My_proteins.fa > My_proteins.PF01731.15_search_out.txt
  - For more details about HMMER, consult the HMMER User's Guide (ftp://selab.janelia.org/pub/software/hmmer3/3.0/Userguide.pdf).

How do I install an R package locally?

#Method 1:
#download package you are interested installing, *tar.gz
#In R command-line
#Location of source AND where to install package
R_libraries_path = "/home/userName/R_libs"
# Go to where the .tar.gz library source is and install
setwd(R_libraries_path)
install.packages("hthgu133ahsentrezgcdf_12.0.0.tar.gz", lib=R_libraries_path, repos=NULL)

#Method 2:
#In bash shell, use directory called R, or whatever you'd like
export R_LIBS="$HOME/R"
#In R command line
source("http://www.bioconductor.org/biocLite.R")
biocLite("pd.mogene.1.0.st.v1")

I need to send/receive very large data files to/from a colleague outside of Whitehead. What is the best way to do this?
- Our IT department has built two tools to help you share large files with your colleagues. [Sendit for files up to 2GB; Vort for files over 2GB http://wi-inside.wi.mit.edu/departments/it/services/filetransfer]
Why do I get different BLAST results from http://fry.wi.mit.edu/blast/ and NCBI Blast?
- WI and NCBI pages have very different defaults. As a result, a hit at WI with an e-value of 1e-12 but at NCBI the alignment is completely different and leads to an e-value of 2e-98.
- Word sizes are different, as are match/mismatch scores (1/-3 at WI, 2/-3 at NCBI).
- Blast at WI filters (hard masks) for low complexity by default, whereas at NCBI it doesn't, instead using the "L;m;" filter string, which filters for low-complexity for the lookup table but not for extension. This can have a huge effect.
- Even at NCBI choosing (a) Human RefSeq sequences or (b) all RefSeq sequences and then filtering for human only also produce somewhat different results. Also, the size of the database has a big effect on the e-values. For one example query that produces the same alignment in three different databases, the e-values are very different:
  - human RefSeq sequences => 0.008
  - all RefSeq sequences => 0.47
  - nt => 2.9
How do I run tophat/bowtie on the LSF with a gzip'd tar (*.tar.gz) file?
- bsub bash -c "tophat ... <(tar xvzfO ...) <(tar xvzfO ...)", this is using process substitution
  - eg. bsub bash -c "tophat -p 10 -g 1 -o mapped_data_SRR905147_unique -N 2 -I 10000 --segment-length 25 --segment-mismatches 2 hg19 <(tar xvzfO s_2_1_sequence.txt.tar.gz ACTTGA-s_2_1_sequence.txt) <(tar xvzfO CAGATC-s_2_1_sequence.txt.tar.gz CAGATC-s_2_1_sequence.txt)"
How can I run AlphaFold 2.3 here at Whitehead?
- You'll need access to a system with a GPU, or the new GPU queue. IT can help you choose and obtain access to GPU systems if required.
- For a working command to be executed on fry.wi.mit.edu, which then sends the AlphaFold command to the GPU node on the slurm cluster, see /nfs/BaRC_Public/BaRC_code/shell/run_AlphaFold/Commands.sh
- Start by copying RunAlphaFold_slurm.sh to your project directory.
- The main commands looks like
```
sbatch -J AF220_1 --export=ALL,FASTA_NAME=Sample_protein_1.fa,USERNAME=myUsername,FASTA_PATH=proteins,AF2_WORK_DIR=/nfs/BaRC_Public/BaRC_code/shell/run_AlphaFold ./RunAlphaFold_2.2.0_slurm.sh

sbatch -J AFmult --export=ALL,FASTA_NAME=Leucine_zipper_minimal.fa,USERNAME=myUsername,FASTA_PATH=proteins,AF2_WORK_DIR=/nfs/BaRC_Public/BaRC_code/shell/run_AlphaFold ./RunAlphaFold_multimer_2.3.2_slurm.sh
```
  where the inputs are
  - AF2_WORK_DIR => project working directory
  - FASTA_PATH => directory within AF2_WORK_DIR with input protein sequence (or .)
  - FASTA_NAME => name of input protein sequence file within $AF2_WORK_DIR/$FASTA_PATH
  - USERNAME => Username for job submission and email
- More information, including an explanation of the output files, is here: https://github.com/deepmind/alphafold

How can I run AlphaFold 3 here at Whitehead?
- You can't. As of May 2024, the code for AlphaFold 3 has not been released. You can, however, run it on Google's AlphaFold 3 web server: https://golgi.sandbox.google.com/

How can I output small p-values from R?
- R often prints small p-values as "p-value < 2.2e-16" whereas the output from the statistical test has actually calculated a much more accurate p-value.
- To access the exact p-value, explicitly call the value from the test output. For example
```
a = jitter(rep(1, 20))
b = jitter(rep(3, 20))
t.test(a,b)  # p-value < 2.2e-16
t.test(a,b)$p.value  # 9.0e-40 (or another small value, dependent on the output from jitter()
```
- Other statistical tests may save the p-value with a different name. Try names(OBJECT) to see your choices and figure out how to access the exact value.
How can I load the earlier version of Seurat in R?
- As of April 18, 2024, the default version of Seurat available on fry/RStudio is v5.0.1.
- To use Seurat version 5, enter the following command in R:
```
library(Seurat)
```
- In some instances, certain software may only be compatible with older versions of Seurat, or you might need to run analyses on Seurat objects created with earlier versions. In these cases, you can load an earlier version of Seurat by specifying the library location.
- For example, to load Seurat version 4, use this command:
```
library(irlba, lib.loc = "/nfs/apps/lib/R/4.2-focal/site-library.2023q4") # the 'irlba' package is required by Seurat for linear algebra
library(Seurat, lib.loc = "/nfs/apps/lib/R/4.2-focal/site-library.2023q1")
```
- This command directs R to use a specific library path where Seurat version 4 is installed, ensuring compatibility with older datasets or software requirements.
- Alternatively, we could explicitly load all R packages from the last R library set to keep the package versions consistent and compatible.
```
set_lib_paths <- function(lib_vec) {
  lib_vec <- normalizePath(lib_vec, mustWork = TRUE)
  shim_fun <- .libPaths
  shim_env <- new.env(parent = environment(shim_fun))
  shim_env$.Library <- character()
  shim_env$.Library.site <- character()
  environment(shim_fun) <- shim_env
  shim_fun(lib_vec)
}

set_lib_paths(c("/nfs/apps/lib/R/4.2-focal/site-library.2023q1", "/opt/R/4.2.1/lib/R/library"))

library(Seurat)
# load other required packages, e.g. scPred, for cell type annotation.
library(scPred)
```
- By running the codes above, all the required R packages will be loaded from the previous R library set (2023q1).

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text