################# ### QUICKSTART #### ################# beRBP is an RBP target prediction algorithm which leverages the Random Forest classifier to analyze RNA sequence/structure features (motif matching, clustering, accessibility, and conservation). beRBP includes 37 Specific models (for RBPs available with PWMs and sufficient training positive examples) and 1 General model; the General model allows for two working modes: selected PWMs (for RBPs with built-in PWMs) and user-supplied PWM (for a new RBP with a PWM unrecognizeable to beRBP). At http://bioinfo.vanderbilt.edu/beRBP/, you can run beRBP via our webserver service and find reference to the methodology. Alternatively, you may decompress the present archive (beRBP.tgz) on your own linux computer and run beRBP locally. You'll need to download the software archive as well as a separate giant library file (hg38.phyloP100way.bw, ~10G), and pre-install several dependency programs (R, R/randomForest, the ViennaRNA Package, bigWigToWig, and NCBI-blast). The uncompressed archive includes the following subdirectories: code, data, lib, work. "code" harbors all scripts, where the top-level scripts are specific.sh, general_sPWM.sh, and general.sh, corresponding to the three work modes mentioned above. In "work/temp/", several *.fasta files and *.pwm.txt are provided as sample input files. Users can put their own input files (*.fasta & *.pwm.txt) here to run their real jobs. Once all pre-requisites have been satisfied, users can type the following commands to test beRBP with sample input 010.fasta & 010.pwm.txt. At our linux server, the first two commands completed within minutes, and the third one finished within 20 minutes. ###### Always make /WHERE_beRBP_IS/work/ your working directory. ###### 010 is regarded as the project name, also the root name of the mandatory fasta file (work/temp/010.fasta) and the user-provided PWM file (work/temp/010.pwm.txt). The PWM file is not required by specific.sh and general_sPWM.sh. ###### general.sh takes only one argument, the project name. In this case, both work/temp/010.fasta and work/temp/010.pwm.txt are required. ###### "CIRBP" as the 2nd argument to specific.sh refers to a Specific model. See data/Specific37.lst. ###### M037_0.6 and MBNL1 as the 2nd and 3rd arguments to general_sPWM.sh refer to a PWM and an RBP. Refer to data/pwm.lst, data/rbp.lst, and data/rbp2pwm2pwmLen. cd /WHERE_beRBP_IS/work/ ../code/specific.sh 010 CIRBP >temp/010.log & # Specific model ../code/general_sPWM.sh 010 M037_0.6 MBNL1 >temp/010.log & # General model, selecting MBNL1's M037_0.6 PWM ../code/general.sh 010 >temp/010.log & # General model, using user-supplied PWM (temp/010.pwm.txt) If you have questions, please contact Qi Liu (qi.liu@vanderbilt.edu) at Center for Quantitative Sciences, Vanderbilt University Medical Center. May 9th, 2018 ######################## ### PREREQUISITES ###### ######################## 1. R 3.2.2 (Higher version may be fine) 2. R Package RandomForest v4.6-12 https://cran.r-project.org/web/packages/randomForest/index.html 3. The ViennaRNA Package v2.1.9 Download package at http://www.tbi.univie.ac.at/RNA/index.html#download. After decompression and installation, add /PATH/TO/RNAfold/ to environment variable $PATH. 4. The standalone program bigWigToWig Download it at http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bigWigToWig. Add PATH/TO/bigWigToWig to your $PATH environment variable. 5. Ncbi-blast v2.2.31+ ##################################################################################################################### ########### YOU MUST CONSTRUCT YOUR OWN blastdb and index, AND MODIFY ONE POINT IN script core_csrv.sh ############## ##################################################################################################################### 5.1 Download blastn at ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ 5.2 Download human genome FASTA sequence HG38.fa at http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz. 5.3 Construct blastdb & index makeblastdb -dbtype nucl -in HG38.fa -out HG38 ##(HG38.nsq, HG38.nhr, HG38.nin will be created). makembindex -input HG38 ##(HG38.[00|01|02.idx and HG38.shd will be created). 5.4 Modify the script /ANYWHERE/beRBP/code/core_csrv.sh to show the absolute path of your own blastdb and index (WHERE_REFGENOME_DB_IS) blastn -query siteSegmentCore_$i.fa -db /WHERE_REFGENOME_DB_IS/HG38 -task megablast -use_index true -index_name /WHERE_REFGENOME_DB_IS/HG38 -word_size 16 6. hg38.phyloP100way.bw This file for conservation score is pretty big (~10G), so we separate it from the rest of the beRBP distributive archive. Please download the file hg38.phyloP100way.bw at http://bioinfo.vanderbilt.edu/beRBP/download.html or http://hgdownload.cse.ucsc.edu/goldenpath/hg38/phyloP100way/hg38.phyloP100way.bw, and put it at the uncompressed directory beRBP/lib/hg38.phyloP100way.bw. ######################## ### HOW TO RUN ME ###### ######################## 1. Place the beRBP archive to a directory in your file system of your choice. Decompress it (unzip beRBP.zip). Let's say you put it in /ANYWHERE/ and you get a new directory /ANYWHERE/beRBP after decompression. Change your working directory to /ANYWHERE/beRBP. ("cd /ANYWHERE/beRBP") 1.1. Edit file /ANYWHERE/beRBP/code/core_csrv.sh, specifying your own blastdb and index (see Point 5.4 in PREREQUISITES, above). 1.2. Ensure that you've got file hg38.phyloP100way.bw and have put it at beRBP/lib/hg38.phyloP100way.bw. 2. Copy or generate your input FASTA file and put it at /ANYWHERE/beRBP/work/temp/JOBID.fasta. Your FASTA file must be named in this format: JOBID.fasta. You can substitute "JOBID" for any character string of your choice, but note that it is the stem of your FASTA file name and will play other roles in beRBP running. We have prepared one example FASTA sequence as beRBP/work/temp/010.fasta, where "010" is JOBID. 3. Go to 3.1 for Specific prediction, 3.2 for Light-duty General prediction, or 3.3 for Heavy-duty General prediction. 3.1 Go through 3.1.x steps for a Specific prediction 3.1.1 Pick one or more Specific model that you would use. You must choose from these 37 items: {CIRBP, CPEB4, ELAVL1_1, ELAVL1_2, ELAVL1_3, ELAVL1_4, ELAVL1_5, FUS, FXR1, FXR2, HNRNPA1, HNRNPA2B1, HNRNPC, HNRNPF, IGF2BP1, IGF2BP2, IGF2BP3, KHDRBS1, LIN28A_1, LIN28A_2, LIN28B, MSI1_1, MSI1_2,NCL, PABPC1, PCBP2, PUM2, QKI, RBFOX2, RBM47, TAF15, TARDBP, TIA1_1, TIA1_2, U2AF2, ZFP36_1, ZFP36_2}. Let's say you pick one RBP of CIRBP, or two RBPs of CIRBP and CPEB4. 3.1.2 Run the following command for a specific prediction using your selected Specific model (say it is CIRBP, or CIRBP and CPEB4): ##### Specific prediction: begin ####### cd /ANYWHERE/beRBP/work/ # Ensure that temp/JOBID.fasta is existent. ../code/specific.sh JOBID CIRBP >temp/JOBID.log & ## You can Change CIRBP to any one from the above 37-RBP list. ## JOBID coincides with the stem of your FASTA file temp/JOBID.fasta ../code/specific.sh JOBID "CIRBP, CPEB4" >temp/JOBID.log & ## You can specify multiple RBPs (separated by ", ") from the above 37-RBP list. ## JOBID coincides with the stem of your FASTA file temp/JOBID.fasta # You may try with the example FASTA file temp/010.fasta ../code/specific.sh 010 CIRBP >temp/010.log & ../code/specific.sh 010 "CIRBP, CPEB4" >temp/010.log & ##### Specific prediction: end ####### 3.2 Go through the following 3.2.x steps for a light-duty General prediction based on in-house PWMs 3.2.1 We have wrapped 217 PWMs for 162 human RBPs as beRBP/lib/M*.pwm.txt files. The stem names of those PWMs coincide with cisBP-RNA PWM records (http://cisbp-rna.ccbr.utoronto.ca/).There is a file beRBP/data/rbp2pwm2pwmLen.txt in which you can see the mapping from 217 PWMs to 162 RBPs. You can predict the binding of one or more of these housed RBPs. Let's say you pick RBP "MBNL1" and its PWM "M037_0.6". Alternatively, you may pick more than one RBPs, say you pick "MBNL1, MBNL2" with the PWMs "M037_0.6, M037_0.6." 3.2.2 Run the following command for a quick run of General prediction, since you are predicting against built-in PWMs: ##### Light-task General prediction: begin ####### cd /ANYWHERE/beRBP/work/ # Ensure that temp/JOBID.fasta is existent. ../code/general_sPWM.sh JOBID M037_0.6 MBNL1 &>temp/JOBID.log & ## You can Change "M037_0.6" and "MBNL1" to any coupled PWM and RBP taken from file rbp2pwm2pwmLen.txt ## JOBID coincides with the stem of your FASTA file temp/JOBID.fasta ../code/general_sPWM.sh JOBID "M037_0.6, M037_0.6" "MBNL1, MBNL2" &>temp/JOBID.log & ## You can specify multiple PWMs (separated by ", ") and associated RBPs from file rbp2pwm2pwmLen.txt. ## JOBID coincides with the stem of your FASTA file temp/JOBID.fasta ../code/general_sPWM.sh JOBID "all" "all" &>temp/JOBID.log & ## Ask beRBP to test all housed PWMs of all RBPs on the input sequence(s) in JOBID.fasta. # You may try with the example FASTA file temp/010.fasta ../code/general_sPWM.sh 010 M037_0.6 MBNL1 &>temp/010.log & ../code/general_sPWM.sh 010 "M037_0.6, M037_0.6" "MBNL1, MBNL2" &>temp/010.log & ../code/general_sPWM.sh 010 "all" "all" &>temp/010.log & ## This will take some time. ##### Light General prediction: end ######### 3.3 Run the following command for a heavy-duty General prediction, if you provide your own PWM as /ANYWHERE/beRBP/work/temp/JOBID.pwm.txt: ##### Heavy-task General prediction: begin ####### cd /ANYWHERE/beRBP/work/ # Ensure that temp/JOBID.fasta and temp/JOBID.pwm.txt are existent. ../code/general.sh JOBID >temp/JOBID.log & # You may try with the example files temp/010.fasta and temp/010.pwm.txt ../code/general.sh 010 >temp/010.log & ##### Heavy-task General prediction: end ####### ########################################## ###### OUTPUT and TROUBLESHOOTING ######## ########################################## A successful execution of beRBP will generate a directory under /ANYWHERE/beRBP/work/ named after your JOBID. The above example commands will lead to the following work directory: /ANYWHERE/beRBP/work/JOBID Within /ANYWHERE/beRBP/work/JOBID, you should see two files and a directory resultMatrix.tsv JOBID.liveORdie log/ ##### resultMatrix.tsv ###### resultMatrix.tsv: a tab-delimited file telling if the concerned RBP binds each input FASTA sequence. The seven columns in the table are defined as follows. seqID - the sequence ID RBP - the RNA-binding protein PWM - the Position-Weigthed-Matrix (accession number indexed in database cisBP-RNA, http://cisbp-rna.ccbr.utoronto.ca/) bind_or_not - beRBP concludes if the concerned RBP binds the sequence or not. 1 for binding and 0 for not binding. sitePos - the position within the sequence at which the binding site starts. siteLen - the length of the binding site. This coincides with the PWM length. voteFrac - the fraction of random forest trees that gave affirmative binding prediction. beRBP has resorted to an optimal voteFrac threshold learned fromthe training process to decide the bind_or_not decision. For different Specific models, the threshold is different. For General model, there is one single optimal threshold of 0.358, which is also learned from the training process. ##### JOBID.liveORdie ####### JOBID.liveORdie should be an empty file if the job succeeded. Otherwise, it may contain one sentence explaining how your input data was not valid. ########### log ############: a directory containing interim log files. if the output files are not as described above, execution of beRBP has failed. Consider troubleshooting by checking the following files: beRBP/work/temp/JOBID.log beRBP/work/JOBID/JOBID.liveORdie ######################################## ######### RUNNING TIME ################# ######################################## The computation time of beRBP depends on the choice between Specific and General jobs, the number of candidate sequences, and the lengths of candidate sequences. The following time estimations were made through a linux server with 32 8-core CPU and 2.6 GHz processor. 1. A Specific prediction or a light-duty General prediction may take a few seconds to work on one RNA sequence of 1.6K nucleotides long. 2. A heavy-duty General prediction may take ~20 minutes to work on one foresaid RNA sequence. 3. Screening against all housed PWMs (217 currently) on one sequence with general_sPWM.sh takes roughly one hour. 3. For a set of ~1K candidate sequences (median length <10K), a Specific prediction and a light-duty General prediction can finish within an hour, whereas a heavy-duty General prediction may take 4 hours.