To learn about Whitehead Institute bioinformatics resources, browse the page, use the Find command in your browser or use the Search box at the top right of the page to search the questions and answers. Frequently Asked Questions 1. Where can I find sample '''[#blastplus blast+ commands?]'''[[br]] [[br]] 1. How can I '''[#align2 align two sequences?]'''[[br]] [[br]] 1. How can I '''[#blat run BLAT locally?]'''[[br]] [[br]] 1. How can I '''[#promoter get the promoter sequence]''' of a gene?[[br]] [[br]] 1. How can I '''[#nonred make a list of items non-redundant]'''?[[br]] [[br]] 1. How can I access the '''[#entrez Whitehead version of the Entrez Gene database]'''?[[br]] [[br]] 1. How can I find slides and materials from '''[#hottopics past Hot Topics talks]'''?[[br]] [[br]] 1. How can I '''[#relational create my own relational database]'''?[[br]] [[br]] 1. How can I '''[#tracks download data/tracks from UCSC]'''?[[br]] [[br]] 1. How can I '''[#barctools access BaRC Tools]''' or know what tools are available?[[br]] [[br]] 1. How can I '''[#slurm submit a job to the slurm cluster]'''?[[br]] [[br]] 1. How can I find out what '''[#perlR Perl modules or R packages]''' are installed? Which version is currently installed in the server?[[br]] [[br]] 1. How can I '''[#xwindow connect to fry]'''?[[br]] [[br]] 1. How can I '''[#servers get to my or my lab shared storage]'''?[[br]] [[br]] 1. Where can I '''[#blast find local BLAST databases]'''?[[br]] [[br]] 1. Where can I '''[#genomeSeqs find genome sequences]'''? [[br]] [[br]] 1. Where can I '''[#btFormats find genomes formatted for bowtie, tophat, or blat]'''?[[br]] [[br]] 1. Where can I '''[#tfs find known or predicted transcription factors that regulate a gene]'''?[[br]] [[br]] 1. Where can I '''[#unix find simple (one-liner) Unix/Perl commmands]'''?[[br]] [[br]] 1. Where can I '''[#Rcode find samples of R code]'''?[[br]] [[br]] 1. Where can I '''[#UCSCmirror find the local mirror of the UCSC genome browser]'''?[[br]] [[br]] 1. Where can I '''[#galaxy find the local mirror of Galaxy]'''?[[br]] [[br]] 1. Where can I '''[#rstudio find a local R Studio server]'''?[[br]] [[br]] 1. Where can I '''[#IGV find IGV download]'''?[[br]] [[br]] 1. Which software should I use to '''[#heatmaps cluster, create and display heatmaps]'''? [[br]] [[br]] 1. Which software should I use to do '''[#GOtools GO enrichment analysis]'''?[[br]] [[br]] 1. Which software should I use to '''[#GeneNetwork display a gene network]'''?[[br]] [[br]] 1. How can I get '''[#software desktop software]''' provided by Whitehead?[[br]] [[br]] 1. Which software should I use to '''[#stats do statistics]'''?[[br]] [[br]] 1. How can I '''[#pfam search for Pfam (protein) profiles]''' in my protein set using HMMs?[[br]] [[br]] 1. How do I '''[#R_pkg_install install an R package locally]'''?[[br]] [[br]] 1. I need to '''[#transfer send/receive very large data files]''' to/from a colleague outside of Whitehead. What is the best way to do this?[[br]] [[br]] 1. Why do I get '''[#wi_ncbi_blast different BLAST results]''' from [[http://fry.wi.mit.edu/blast/ | WI]] and NCBI Blast? [[br]] [[br]] 1. How do I run '''[#tophat_bowtie tophat/bowtie on the LSF with a gzip'd tar (*.tar.gz)]''' file? [[br]] [[br]] 1. How can I run '''[#alphafold AlphaFold 2.0-2.3]''' here at Whitehead? [[br]] [[br]] 1. How can I run '''[#alphafold3 AlphaFold 3]''' here at Whitehead? [[br]] [[br]] 1. How can I output '''[#pvalues small p-values]''' from R? [[br]] [[br]] 1. How can I load the '''[#Seurat earlier version of Seurat]''' in R? [[br]] [[br]] ---- Answers to Frequently Asked Questions 1. [=#blastplus Where can I find sample blast+ commands?] [[br]] [[br]] * See [http://barcwiki.wi.mit.edu/wiki/blastTips BLAST+ tips][[br]] [[br]] 1. [=#align2 How can I '''align two sequences'''?] [[br]] [[br]] * Use an EMBOSS program ([http://bioinfo.wi.mit.edu/bio/tools/emboss/]) for an optimal alignment * **water** for a Smith-Waterman optimal local alignment * **needle** for a Needleman-Wunsch optimal global alignment * **stretcher** for a Needleman-Wunsch optimal global alignment (optimized for longer sequences) * Use **blast2seq** [https://blast.ncbi.nlm.nih.gov/Blast.cgi?BLAST_SPEC=blast2seq&LINK_LOC=align2seq&PAGE_TYPE=BlastSearch] for a quick local alignment [[br]] [[br]] 1. [=#blat How can I '''run BLAT locally'''?] [[br]] [[br]] * See our [http://bioinfo.wi.mit.edu/bio/bioinfo/docs/blat_tak.html Using BLAT on fry] page.[[br]] [[br]] 1. [=#promoter How can I '''get the promoter sequence''' of a gene?] [[br]] [[br]] - Go to the [http://genome.ucsc.edu/cgi-bin/hgGateway UCSC Genome Bioinformatics] genome browser. - Choose your desired genome and enter your desired gene (in the "position or search term" box). - If the gene has multiple transcripts, choose the one you want. - Paying attention to the direction of the gene (indicated by the intron hash marks), not the coordinate of the transcription start site (TSS) - Enter a range of coordinates before and/or after the TSS and click on "jump". - When you have the desired range in the browser, click on "DNA" on the top blue bar. - Check the "Reverse complement" box if your gene is on the negative strand. - Click on the "get DNA" button. - If you want to check your sequence relative to the TSS, map it with [http://genome.ucsc.edu/cgi-bin/hgBlat?command=star BLAT].[[br]] [[br]] 1. [=#nonred How can I '''make a list of items non-redundant'''?] [[br]] [[br]] * See our [http://barc.wi.mit.edu/tools/redundant/ Redundant List Analysis ] page, which also counts how many times each item appears in your list.[[br]] [[br]] 1. [=#entrez How can I '''access the Whitehead version of the Entrez Gene database'''?] [[br]] [[br]] * Whitehead BaRC designed a local copy of the Entrez Gene database using MySQL * You need a MySQL client to access the database, either a desktop tool like [http://wb.mysql.com/ MySQL Workbench] or a fry account. * The information you need: * Hostname = devo.wi.mit.edu * database = entrez_gene * username = barc_read_only * password = Ask BaRC about this * On fry, use the command * mysql -u barc_read_only -h devo.wi.mit.edu -D entrez_gene -p[[br]] [[br]] 1. [=#hottopics How can I find slides and materials from '''past Hot Topics talks'''?] [[br]] [[br]] * See our [http://barc.wi.mit.edu/education/hot_topics/ Hot Topics ] page, with links to presentations and other materials.[[br]] [[br]] 1. [=#relational How can I '''create my own relational database'''?] [[br]] [[br]] * You have the choice of creating a MySQL on your own computer or using Whitehead's MySQL server (canna, which would generally be more robust, if that's needed). * If you'd like your own installation, download [http://dev.mysql.com/downloads/mysql/ MySQL] and install it. * If you'd like to use devo, email callcenter@wi.mit.edu and request a database on devo. Once the IT group creates the database, you will be free to add tables and data. * Regardless of the system see the [http://dev.mysql.com/doc/refman/5.5/en/index.html MySQL Reference Manual] and our past BaRC presentations about MySQL * [http://barc.wi.mit.edu/education/bioinfo2006/db4bio/ Relational Databases For Biologists] * [http://barc.wi.mit.edu/education/bioinfo2006/db4bio/ Querying Biological Databases with SQL][[br]] [[br]] 1. [=#tracks How can I '''download data/tracks from UCSC'''?] [[br]] [[br]] * Go to [http://genome.ucsc.edu/ UCSC Genome Bioinformatics] * Click on "Downloads" on the bar on the left side * Choose the desired species and assembly, noting that coordinates only apply to the assembly they were generated with. * Data from most tracks are available by following the "Annotation database" link. * Every file is either the actual data in a tab-delimited text file (*.txt.gz) or a small file that provides a name for each column. * The data files can be open in Excel or processed as text files or used to create a table in your MySQL database. * Note from the data of the annotation files that some are updated much more often than others.[[br]] [[br]] 1. [=#barctools How can I '''access BaRC Tools''' or know what tools are available?] [[br]] [[br]] * BaRC Tools can be found on [http://bioinfo.wi.mit.edu/bio/tools/ BaRC Tools]. Available tools are summarized in [http://bioinfo.wi.mit.edu/bio/education/hot_topics/barc_tools/barcTools-summary.pdf Summary] [[br]] [[br]] 1. [=#slurm How can I '''submit a job to the slurm cluster'''?] [[br]] [[br]] * Usually you just need to wrap your usual command with syntax like {{{ sbatch --partition=20 --job-name=MY_JOB --mem=32G --wrap "MY COMMAND" }}} where * 'sbatch' is the main command * 'partition' indicates that you want to use Ubuntu 20 servers (with the most up-to-date software) * 'job-name' is any short one-word name so you can more easily monitor its progress * 'mem' is a required parameter requesting a specific amount of memory, and * 'wrap' means that 'sbatch will wrap the specified command string in a simple "sh" shell script, and submit that script to the slurm controller'. * See [https://clusterguide.wi.mit.edu/using-the-slurm-cluster/ IT's Using the slurm cluster] for sample commands, changes in the environment (compared to the older LSF cluster) and links to more documentation [[br]] [[br]] 1. [=#perlR How can I find out what '''Perl modules or R packages''' are installed?] Which version is currently installed in the server? [[br]] [[br]] * There are links on [https://trac.wi.mit.edu/wiki the home page] of Trac, our Tak tracking system to [http://tak/trac/wiki/Packages installed packaged software], [http://tak/trac/wiki/Perl installed Perl modules], [http://tak/trac/wiki/Python installed Python modules], and [http://tak/trac/wiki/R installed R modules]. [[br]] [[br]] 1. [=#xwindow How can I '''connect to tak'''?][[br]] [[br]] * To connect to tak, you need a [http://bioinfo.wi.mit.edu/bio/software/unix/bioinfoaccount.php tak account] and some kind of secure shell (ssh) with X Windows (to get the graphics): * On a Macintosh, use x11 or Terminal * On a Windows computer, we recommend [http://cygwin.com Cygwin/X]. You can also use the Whitehead IT [http://bioinfo.wi.mit.edu/bio/tutorials/takpack/TakPack-Installer.exe TakPack installer]. * With either system, double click on the icon to get the "command prompt", the window in which you can type commands. * Windows only: After opening Cygwin, start X Windows by typing "startx". A new terminal window will open, and you should use that one. * From the command prompt, connect to fry (or another Unix/Linux computer) with a command like {{{ ssh username@fry.wi.mit.edu -Y or ssh username@fry.wi.mit.edu -X where username is the name of your fry account. You'll be prompted for your password. }}} * In addition, see the tutorials created by IT called [http://wi-inside.wi.mit.edu/departments/it/services/scientificcomputing/scitutorials Install Cygwin or TakPack]. IT recommends using TakPack if possible as it provides both x11 and and ssh client and is a "lighter install".[[br]] [[br]] 1. [=#servers How can I '''get to my or my lab shared storage'''?] [[br]] [[br]] * On a Mac or Windows computer, most shared storage areas can be accessed via '''wi-files1''' or '''wi-files2''', although high-throughput sequencing data is accessed via '''wi-htdata'''. * On a Mac computer, get to a server like wi-files1/BaRC_Public (/nfs/BaRC_Public) by connecting to {{{ cifs://wi-files1/BaRC_Public }}} * On a Windows computer, get to a server like wi-files1/BaRC_Public (/nfs/BaRC_Public) by connecting to {{{ \\wi-files1\BaRC_Public }}} * See [http://it.wi.mit.edu/systems/file-storage/lab-share-paths lab share paths] to get to your lab storage area. [[br]] [[br]] 1. [=#blast Where can I '''find local BLAST databases'''?] [[br]] [[br]] * BLAST formated databases can be found in /nfs/seq/Data on fry.[[BR]][[br]] 1. [=#genomeSeqs Where can I '''find genome sequences'''? ] [[br]] [[br]] * Genome sequences can be found in /nfs/genomes on fry.[[BR]][[br]] 1. [=#btFormats Where can I '''find genomes formatted for bowtie, tophat, or blat'''?] [[br]] [[br]] * Within many directories on /nfs/genomes you can these additional files.[[BR]][[br]] 1. [=#tfs Where can I '''find known or predicted transcription factors that regulate a gene'''?] [[br]] [[br]] * We do not have access to [https://portal.biobase-international.com/cgi-bin/portal/login.cgi BIOBASE Knowledge Library], however, an (older) command-line version is available. See BaRC SOPs for more info.[[br]] * [http://portal.genego.com/ GeneGO (Login Required)] can be used as well to find known TFs [[br]] [[br]] 1. [=#unix Where can I '''find simple (one-liner) Unix/Perl commmands'''?] [[br]] [[br]] * There is a helpful list of Unix and Perl commands at [http://bioinfo.wi.mit.edu/bio/bioinfo/scripts/].[[br]] [[br]] 1. [=#Rcode Where can I '''find samples of R code'''?] [[br]] [[br]] * Sample R code is available /nfs/BaRC_Public/BaRC_code.[[br]][[br]] 1. [=#UCSCmirror Where can I '''find the local mirror of the UCSC genome browser'''?][[br]] [[br]] * [http://ucsc.wi.mit.edu/] * To add tracks via files, copy the files to /nfs/solexa_ucsc and then access it using the URL http://weblinks.wi.mit.edu/solexa_ucsc/ to submit tracks for the browser[[br]] [[br]] 1. [=#galaxy Where can I '''find the local mirror of Galaxy'''?][[br]] [[br]] * It used to be at [https://galaxy.wi.mit.edu/] but we no longer provide Galaxy. [[br]] [[br]] 1. [=#rstudio Where can I '''find a local R Studio server'''?][[br]] [[br]] * [https://rstudio.wi.mit.edu] [[br]] [[br]] 1. [=#IGV Where can I '''find IGV download'''?] [[br]] [[br]] * [http://www.broadinstitute.org/software/igv/log-in][[br]] [[br]] 1. [=#heatmaps Which software should I use to '''cluster, create and display heatmaps'''?] [[br]] [[br]] * Cluster 3.0 http://bonsai.hgc.jp/~mdehoon/software/cluster/software.htm * Java Treeview http://jtreeview.sourceforge.net/ [[br]] [[br]] 1. [=#GOtools Which software should I use to do '''GO enrichment analysis'''?] [[br]] [[br]] * DAVID http://david.abcc.ncifcrf.gov/ - our favorite tool for entering a list of genes of interest * GSEA http://www.broadinstitute.org/gsea/index.jsp - a Java application that can take a ranked list of all your genes as input; our favorite tool for including all genes * BIOBASE https://portal.biobase-international.com/cgi-bin/portal/login.cgi * BiNGO (within Cytoscape http://www.cytoscape.org/) * GoMiner http://discover.nci.nih.gov/gominer * GOstat http://gostat.wehi.edu.au * [http://www.geneontology.org/GO.tools.shtml#term_enrichment GeneOntology.org] has a more complete list. * Also see the [http://barc.wi.mit.edu/education/hot_topics/ Hot Topics talk] "Gene list enrichment analysis" for more information. [[br]] [[br]] 1. [=#GeneNetwork Which software should I use to '''display a gene network'''?] [[br]] [[br]] * Cytoscape http://www.cytoscape.org/ is our favorite tool for this. [[br]] [[br]] 1. [=#software How can I get '''desktop software''' provided by Whitehead?][[br]] [[br]] * Software is available through the [https://software.wi.mit.edu/ Whitehead Software database]. To see a list of desktop software, see [http://it.wi.mit.edu/software/get-software].[[br]][[br]] 1. [=#stats Which software should I use to '''do statistics'''?] [[br]] [[br]] * GraphPad Prism (in the [https://software.wi.mit.edu/ Whitehead software database]) -- has an easy-to-use GUI and excellent practical documentation * MatLab (in the https://software.wi.mit.edu/ Whitehead software database]) * R (free download from http://www.r-project.org/) * Also see the BaRC Hot Topics on [http://barc.wi.mit.edu/education/hot_topics/prism/Prism.pdf An Introduction to GraphPad Prism - statistics and graphing software] [[br]][[br]] 1. [=#pfam How can I '''search for Pfam (protein) profiles'''] in my protein set using HMMs? * A local copy of all Pfam HMMs can be found at /nfs/seq/pfam_db/Pfam-A.hmm * Pfam profiles can be easily searched with the HMMER suite of tools * If you want to annotate a set of proteins with only specific profiles, you can extract one profile at a time with hmmfetch * ex: hmmfetch /nfs/seq/pfam_db/Pfam-A.hmm PF01731.15 > PF01731.15.hmm * Use hmmsearch to search your protein set (as a multiple-sequence fasta file) with one to all profiles * ex1 (all profiles): hmmsearch /nfs/seq/pfam_db/Pfam-A.hmm My_proteins.fa > My_proteins.Pfam_search_out.txt * ex2 (selected profile): hmmsearch PF01731.15.hmm My_proteins.fa > My_proteins.PF01731.15_search_out.txt * For more details about HMMER, consult the HMMER User's Guide ([ftp://selab.janelia.org/pub/software/hmmer3/3.0/Userguide.pdf]).[[br]] [[br]] 1. [=#R_pkg_install How do I '''install an R package''' locally?] {{{ #Method 1: #download package you are interested installing, *tar.gz #In R command-line #Location of source AND where to install package R_libraries_path = "/home/userName/R_libs" # Go to where the .tar.gz library source is and install setwd(R_libraries_path) install.packages("hthgu133ahsentrezgcdf_12.0.0.tar.gz", lib=R_libraries_path, repos=NULL) #Method 2: #In bash shell, use directory called R, or whatever you'd like export R_LIBS="$HOME/R" #In R command line source("http://www.bioconductor.org/biocLite.R") biocLite("pd.mogene.1.0.st.v1") }}} 1. [=#transfer I need to '''send/receive very large data files''' to/from a colleague outside of Whitehead. What is the best way to do this?] * Our IT department has built two tools to help you share large files with your colleagues. [Sendit for files up to 2GB; Vort for files over 2GB http://wi-inside.wi.mit.edu/departments/it/services/filetransfer] [[br]] [[br]] 1. [=#wi_ncbi_blast Why do I get '''different BLAST results''' from http://fry.wi.mit.edu/blast/ and NCBI Blast?] * WI and NCBI pages have very different defaults. As a result, a hit at WI with an e-value of 1e-12 but at NCBI the alignment is completely different and leads to an e-value of 2e-98. * Word sizes are different, as are match/mismatch scores (1/-3 at WI, 2/-3 at NCBI). * Blast at WI filters (hard masks) for low complexity by default, whereas at NCBI it doesn't, instead using the "L;m;" filter string, which filters for low-complexity for the lookup table but not for extension. This can have a huge effect. * Even at NCBI choosing (a) Human RefSeq sequences or (b) all RefSeq sequences and then filtering for human only also produce somewhat different results. Also, the size of the database has a big effect on the e-values. For one example query that produces the same alignment in three different databases, the e-values are very different: * human RefSeq sequences => 0.008 * all RefSeq sequences => 0.47 * nt => 2.9 1. [=#tophat_bowtie How do I run '''tophat/bowtie on the LSF with a gzip'd tar (*.tar.gz)''' file?] * bsub bash -c "tophat ... <(tar xvzfO ...) <(tar xvzfO ...)", this is using process substitution * eg. bsub bash -c "tophat -p 10 -g 1 -o mapped_data_SRR905147_unique -N 2 -I 10000 --segment-length 25 --segment-mismatches 2 hg19 <(tar xvzfO s_2_1_sequence.txt.tar.gz ACTTGA-s_2_1_sequence.txt) <(tar xvzfO CAGATC-s_2_1_sequence.txt.tar.gz CAGATC-s_2_1_sequence.txt)" 1. [=#alphafold How can I run '''AlphaFold 2.3''' here at Whitehead?] * You'll need access to a system with a GPU, or the new GPU queue. IT can help you choose and obtain access to GPU systems if required. * For a working command to be executed on fry.wi.mit.edu, which then sends the AlphaFold command to the GPU node on the slurm cluster, see /nfs/BaRC_Public/BaRC_code/shell/run_AlphaFold/Commands.sh * Start by copying RunAlphaFold_slurm.sh to your project directory. * The main commands looks like {{{ sbatch -J AF220_1 --export=ALL,FASTA_NAME=Sample_protein_1.fa,USERNAME=myUsername,FASTA_PATH=proteins,AF2_WORK_DIR=/nfs/BaRC_Public/BaRC_code/shell/run_AlphaFold ./RunAlphaFold_2.2.0_slurm.sh sbatch -J AFmult --export=ALL,FASTA_NAME=Leucine_zipper_minimal.fa,USERNAME=myUsername,FASTA_PATH=proteins,AF2_WORK_DIR=/nfs/BaRC_Public/BaRC_code/shell/run_AlphaFold ./RunAlphaFold_multimer_2.3.2_slurm.sh }}} where the inputs are * AF2_WORK_DIR => project working directory * FASTA_PATH => directory within AF2_WORK_DIR with input protein sequence (or .) * FASTA_NAME => name of input protein sequence file within $AF2_WORK_DIR/$FASTA_PATH * USERNAME => Username for job submission and email * More information, including an explanation of the output files, is here: https://github.com/deepmind/alphafold 1. [=#alphafold3 How can I run '''AlphaFold 3''' here at Whitehead?] * You can't. As of May 2024, the code for AlphaFold 3 has not been released. You can, however, run it on Google's AlphaFold 3 web server: https://golgi.sandbox.google.com/ 1. [=#pvalues How can I output '''small p-values''' from R?] * R often prints small p-values as "p-value < 2.2e-16" whereas the output from the statistical test has actually calculated a much more accurate p-value. * To access the exact p-value, explicitly call the value from the test output. For example {{{ a = jitter(rep(1, 20)) b = jitter(rep(3, 20)) t.test(a,b) # p-value < 2.2e-16 t.test(a,b)$p.value # 9.0e-40 (or another small value, dependent on the output from jitter() }}} * Other statistical tests may save the p-value with a different name. Try names(OBJECT) to see your choices and figure out how to access the exact value. 1. [=#Seurat How can I load the '''earlier version of Seurat''' in R?] * As of April 18, 2024, the default version of Seurat available on fry/RStudio is v5.0.1. * To use Seurat version 5, enter the following command in R: {{{ library(Seurat) }}} * In some instances, certain software may only be compatible with older versions of Seurat, or you might need to run analyses on Seurat objects created with earlier versions. In these cases, you can load an earlier version of Seurat by specifying the library location. * For example, to load Seurat version 4, use this command: {{{ library(irlba, lib.loc = "/nfs/apps/lib/R/4.2-focal/site-library.2023q4") # the 'irlba' package is required by Seurat for linear algebra library(Seurat, lib.loc = "/nfs/apps/lib/R/4.2-focal/site-library.2023q1") }}} * This command directs R to use a specific library path where Seurat version 4 is installed, ensuring compatibility with older datasets or software requirements. * Alternatively, we could explicitly load all R packages from the last R library set to keep the package versions consistent and compatible. {{{ set_lib_paths <- function(lib_vec) { lib_vec <- normalizePath(lib_vec, mustWork = TRUE) shim_fun <- .libPaths shim_env <- new.env(parent = environment(shim_fun)) shim_env$.Library <- character() shim_env$.Library.site <- character() environment(shim_fun) <- shim_env shim_fun(lib_vec) } set_lib_paths(c("/nfs/apps/lib/R/4.2-focal/site-library.2023q1", "/opt/R/4.2.1/lib/R/library")) library(Seurat) # load other required packages, e.g. scPred, for cell type annotation. library(scPred) }}} * By running the codes above, all the required R packages will be loaded from the previous R library set (2023q1).