Introduction¶

NRSA (Nascent RNA Sequencing Analysis) is a bioinformatics tool dedicated to analyze nascent transcription profiles generated by PRO-seq and GRO-seq data. NRSA not only quantifies nascent transcription for known genes, but also detects, annotates, and quantifies active enhancers, and predict their targets based on the closest TSS (transcriptional start site), within a distance (default: 50kb) strategy and enhancer-TSS associations from FANTOM5 and 4DGenome if available. In addition, NRSA smoothly integrates other genomic data to prioritize enhancers.

Known genes: Quantify transcriptional pausing and detect condition-dependent transcriptional changes in proximal-promoter and gene body regions.

Active enhancers: Identify, prioritize and annotate active enhancers, predict enhancer targets, and quantify condition-dependent changes of eRNAs.

Currently, NRSA supports five organisms: Human(hg19), Mouse(mm10), D. melanogaster(dm3), C. elegans(ce10) and Zebrafish(danRer10). Please let us know if you want add availability for new organism(s).

A newer version of NRSA is now available at eNRSA

it offers significant performance enhancements and new features, including:

Achieves substantial performance enhancements (~20x faster with markedly reduced memory usage).
Compatible with any species using GTF and FASTA reference files.
Supports multi-factor experimental designs, with adjustments for confounding factors.
Enables identification of alternative transcription start sites (TSS) and termination sites (TTS).
Provides analysis of readthrough transcription dysregulation.

Download¶

Software: NRSA.zip (deprecated, not recommended), NRSA-v2.zip

Hdac3 wild type:
YZ-1.bed
YZ-2.bed

Hdac3 knock-out:
YZ-3.bed
YZ-4.bed

Hdac3 peak file:
Hdac3_ZT10_peaks.narrowPeak

K562 PRO-seq: K562-PROseq.bam

K562 GRO-seq: K562-GROseq.bam

U2OS GRO-seq: GSM1634453.bam; GSM1634455.bam

GM12878 GRO-seq: GM12878-GROseq.bam

Installation¶

1. Requirements: Perl, R, HOMER and bedtools

Install Perl, R, HOMER and bedtools (v2.24.0 or later), and add /bin directory to your executable path.

Perl: https://www.perl.org/
R: http://www.r-project.org/
HOMER: http://homer.salk.edu/homer/introduction/install.html
bedtools: https://github.com/arq5x/bedtools2/tags following the instructions.

2. Install NRSA

unzip NRSA.zip       #Unzip the file
cd NRSA/                  #Change directories into the folder
chmod 755 bin/*.pl        #Change the mode of executable files
chmod 755 bin/*.modules   #Change the mode of executable files
dos2unix bin/*.*

#Add NRSA scripts to Shell searching path ($PATH). This step is optional.
#If your NRSA is installed at /home/usrname/NRSA
export PATH=/home/usrname/NRSA/bin/:$PATH

3. Install required perl packages

#Check whether all the packages needed are installed by running test.modules,
./bin/test.modules

#the output of test.modules looks like,
 ok   Cwd 
 ok   File::Basename
 ok   File::Path
 ok   Getopt::Long
 fail Statistics::Basic 
 fail Statistics::R
 
#If “fail” shows up before the package name, that means users don’t have this required package. In this case, Statistics::Basic and Statistics::R need to be installed. 
Users can install missing packages by running "./bin/install.modules packagename".
#e.g  
./bin/install.modules Statistics::Basic
  
#Add the NRSA lib to your PERL5LIB environment variable,
#If your NRSA is installed at /home/usrname/NRSA
export PERL5LIB=$PERL5LIB:/home/usrname/NRSA/lib/lib/perl5

4. Install required R packages

gplots and DESeq2 (v1.2.10)

#In your R shell, type the following commands:
#gplots
install.packages("gplots")

#DESeq2         		
source("http://bioconductor.org/biocLite.R")
biocLite("DESeq2")

Input and output¶

1. Input

NRSA takes alignment files in bed or bam format as input. If each condition has more than one samples, they should be given as a space-separated list. It is not required to have the same number of samples for each condition. Spike-in sequences should be removed before starting the analysis.

2. Output

NRSA generates various kinds of results, including nascent RNA abundance in proximal-promoter and gene body regions, pausing index and pausing significance of known genes, pausing index changes across conditions, as well as identified active enhancers and long eRNAs and their quantification. Additionally, NRSA generates heatmap and boxplot figures providing a global view of the data. Users can also use tools provided by NRSA to customize the figures they want to generate.

Enhancer prioritization¶

To identify function important enhancers, NRSA introduced a new algorithm to prioritize enhancers based on the assumption that enhancers bound by the regulator of interest and affecting the transcription of target genes are highly likely to mediate the regulator's function. Integrating GRO/PRO-seq data with ChIP-seq peaks of the regulator, NRSA combines the binding and the functional evidence to rank each enhancer:

Where E_s is the combined score for the enhancer, B_s is the upstream binding effect score and F_s is the downstream functional activity score, w is a weight to balance the relative impact of binding and functional evidence. These three scores for each enhancer are included in the output file Enhancer.txt.

Walkthrough example using PRO-seq data¶

There are two conditions, Hdac3 wild type and Hdac3 knock-out in mouse liver. Each condition has two replicates. Please download them from the Download section and save them in the data/ folder of NRSA.

Hdac3 wild type:

YZ-1.bed
YZ-2.bed

Hdac3 knock-out:

YZ-3.bed
YZ-4.bed

Note: please remove spike-ins before running NRSA if you have.

Besides the following step-by-step guide, you can also download this bash file for running the example. example.sh

Step 1: Quantify nascent RNAs and identify significantly paused genes. If there are two conditions, NRSA also detects the transcriptional changes between conditions. (pause_PROseq.pl)¶

perl ./bin/pause_PROseq.pl -o ./pause_out/ -in1 ./data/YZ-1.bed ./data/YZ-2.bed -in2 ./data/YZ-3.bed ./data/YZ-4.bed -m mm10

BED/BAM files for each condition are separated by space. BED/BAM files for the first condition are specified by -in1, while those for the second condition are specified by -in2, and the output directory is specified by -o (Note that this directory will be used as the work directory in step 2). The organism is defined by -m. hg19, mm10, dm3, ce10 and danRer10 are available, default is hg19. Make sure the genome version is consistent with the one reads were aligned to. The differential analysis is performed on condition 2 vs. 1.
If you have added NRSA scripts to Shell searching path ($PATH), please use the command "pause_PROseq.pl -o ./pause_out/ -in1 ./data/YZ-1.bed ./data/YZ-2.bed -in2 ./data/YZ-3.bed ./data/YZ-4.bed -m mm10" instead.

The main results are put into the folder known-gene/ and intermediate results are into the folder intermediate/.

Main results (known-gene/):

Two files: quantification of transcription for each gene in proximal-promoter and gene body regions

pindex.txt	pausing index and significance for each gene in all samples
normalized_pp_gb.txt	normalized reads count in promoter-proximal and gene body regions for each gene in all samples

Three files: condition-dependent transcription changes for each gene

pp_change.txt	transcription changes of promoter-proximal regions across two conditions
gb_change.txt	transcription changes of gene body regions across two conditions
pindex_change.txt	pausing index changes across two conditions

Seven figures: a global view of the data

boxplot_ppdensity.pdf	box plot of normalized read density of promoter-proximal regions for each sample
boxplot_gbdensity.pdf	box plot of normalized read density of gene body regions for each sample
boxplot_pausingIndex.pdf	box plot of pausing index for each sample
pindex_change.pdf	heatmap of pausing index change across two conditions for genes with adjp<0.05
heatmap.pdf	heatmap of condition-dependent transcription changes around TSS for active genes
Reps-condition1.tif	histogram for transcriptional changes between replicates in condition 1
Reps-condition2.tif	histogram for transcriptional changes between replicates in condition 2

The figures Reps_condition1/2.tif provide some hint on whether normalization works or not since there should be no global transcriptional changes between replicates.

Intermediate results that users might be interested in (intermediate/):

(samplename)_raw.txt	raw counts, density, summit and pausing index for each sample
(samplename)_FDR.txt	same to (samplename)_raw.txt, but adding significant information of pausing index
count_pp_gb.txt	raw counts for all samples
active_gene.txt	active gene list
nf.txt	normalization factors of each sample

Step 2: Identify, quantify, annotate, prioritize, and detect transcriptional changes of active enhancers and long eRNAs.¶

perl ./bin/eRNA.pl -w ./pause_out/ -in1 ./data/YZ-1.bed ./data/YZ-2.bed -in2 ./data/YZ-3.bed ./data/YZ-4.bed -m mm10 -pri 1 -dir 0 -peak ../Hdac3_ZT10_peaks.narrowPeak -lk pp -cf 0.001

BED/BAM files for each condition are separated by space. Data for the first condition are specified by -in1, while those for the second condition are specified by -in2 (the input files should be the same with step 1), and the organism is defined by -m. Here, -w is used to provide the work directory of step 2 (it should be the same as the output directory of step 1. Even if users only want to run enhancer analysis, the first step should be run first to generate information required for this step).
If you have added NRSA scripts to Shell searching path ($PATH), please use the command "eRNA.pl -w ./pause_out/ -in1 ./data/YZ-1.bed ./data/YZ-2.bed -in2 ./data/YZ-3.bed ./data/YZ-4.bed -m mm10 -pri 1 -dir 0 -peak ../Hdac3_ZT10_peaks.narrowPeak -lk pp -cf 0.001" instead.

The outputs of step 2 are put into two folders: eRNA/ and intermediate/. Here the intermediate/ folder is the same one of step 1.

Main results (eRNA/):

Three files: identification/quantification for active enhancers

Enhancer.txt	list of identified enhancers with annotation, predicted target genes from different strategies, and rank scores
Enhancer_center.txt	list of enhancer centers
normalized_count_enhancer.txt	normalized counts for each enhancer

One file: transcriptional change for novel active enhancers

Enhancer_change.txt

transcription change of enhancers

One figure: novel active enhancers

signal_around_ehancer-center.pdf

PRO-seq signal around enhancer center for all samples

Three files: identification/ quantification files for long eRNAs

long_eRNA.txt	indentified long eRNAs (default: length > 10 Kb)
longeRNA-pindex.txt	pausing information of long eRNAs for all samples
longeRNA-normalized_pp_gb.txt	normalized reads count in promoter-proximal and gene body regions of long eRNAs

Three files: transcriptional changes for long eRNAs

longeRNA-pp_change.txt	transcription changes of promoter-proximal regions of long eRNAs across two conditions
longeRNA-gb_change.txt	transcription changes of gene body regions of long eRNAs across two conditions
longeRNA-pindex_change.txt	pausing index changes of long eRNAs across two conditions

Intermediate results (intermediate/):

(samplename)_longeRNA_raw.txt	raw counts, density, summit and pausing index of long eRNAs for each sample, one file per sample
(samplename)_longeRNA_FDR.txt	same to (samplename)_raw.txt, but adding significant information of pausing index
count_enhancer.txt	raw counts of enhancers for all samples
longeRNA-count_pp_gb.txt	raw counts or long eRNAs for all samples
PROseq_signal.txt	raw signal around enhancer center for plot (not normalized)

Basic usage¶

If you added NRSA scripts to your Shell searching path ($PATH), please omit "perl" for the following commands. e.g. for pause_PROseq.pl, please use "pause_PROseq.pl options...." instead of "perl pause_PROseq.pl options....".

pause_PROseq.pl¶

Usage: perl pause_PROseq.pl [options] -in1 bed/bam files -in2 bed/bam files
e.g: perl pause_PROseq.pl -o pause_out/ -in1 control1.bed control2.bed -in2 case1.bed case2.bed

`-m [string]`	defines the genome: hg19, mm10, dm3, ce10, or danRer10, default: hg19 (Make sure the genome version is consistent with the one the reads were aligned to.)
`-in1 [bed/bam]`	required, read alignment files in bed (6 columns) or bam format for condition1, each file is separated by space
`-in2 [bed/bam]`	read alignment files in bed (6 columns) or bam format for condition2, each file is separated by space (It is NOT required to have the same number of samples for each condition. The differential analysis is performed on condition 2 vs. 1.)
`-f1 [string]`	normalization factors for samples of condition1, separated by space. If not specified, we will use DESeq2 default method to normalize
`-f2 [string]`	normalization factors for samples of condition2, same to `-f1`
`-o [string]`	required, output/work directory
`-u [int]`	defines the upstream of TSS as promoter (bp, default: 500)
`-d [int]`	defines the downstream of TSS as promoter (bp, default: 500)
`-b [int]`	defines the start of gene body density calculation (bp, default: 1000)
`-l [int]`	defines the minimum length of a gene to perform the analysis (bp, default: 1000). Genes with length less than this will be ignored.
`-w [int]`	defines the window size (bp, default: 50)
`-s [int]`	defines the step size (bp, default: 5)
`-h`	help message

Some abbreviation in outputs,

Column ID	Description
Transcript	Transcript ID
Gene	Gene symbol
ppc	reads count in promoter-proximal region
ppm	mappable sites in promoter-proximal region
ppd	reads density of promoter-proximal region
pps	the site position with highest reads count in promoter-proximal region (pausing site)
gbc	reads count in gene body region
gbm	mappable sites in gene body region
gbd	reads density of gene body region
pauseIndex	pausing index

eRNA.pl¶

Usage: perl eRNA.pl [options] -in1 bed/bam files -in2 bed/bam files
e.g: perl eRNA.pl -w pause_out/ -in1 control1.bed control2.bed -in2 case1.bed case2.bed

`-m [string]`	defines the genome: hg19, mm10, dm3, ce10, or danRer10, default: hg19 (Make sure the genome version is consistent with the one reads were aligned to.)
`-c [int]`	distance cutoff for divergent transcripts for enhancer detection (bp, default: 400)
`-d [int]`	distance within which two eRNAs are merged (bp, default: 500)
`-w [string]`	required, working directory of eRNA.pl, should be the same of pause_PROseq.pl’s output directory
`-le [int]`	length cutoff for long-eRNA identification (bp, default: 10000)
`-in1 [bed/bam]`	required, read alignment files in bed (6 columns) or bam format for condition1, each file is separated by space
`-in2 [bed/bam]`	read alignment files in bed (6 columns) or bam format for condition2, each file is separated by space (same files should be used for pause_PROseq.pl and eRNA.pl)
`-f [int]`	whether to filter for enhancers: 0 (not filter) or 1 (filter), deflault: 1
`-dtss [int]`	if filter enhancers, the minimum distance from enhancer to TSS (bp, default: 2000)
`-dtts [int]`	if filter enhancers, the minimum distance from enhancer to TTS (bp, default: 20000)
`-wd [int]`	distance within which associate active genes of enhancer (bp, default: 50000)
`-pri [int]`	prioritize enhancer or not. 1: yes; 0: no (default: 0)
`-peak [string]`	required if -pri set to 1. ChIP-seq peak file for the regulator of interest in bed format
`-lk [string]`	valid when -pri set to 1 for two conditions. Transcriptional changes to be used for enhancer prioritization, the value can be pp (changes in promoter-proximal region), gb (changes in gene body region), or pindex (changes in pausing index) (default: gb)
`-dir [int]`	valid when -pri set to 1 for two conditions. The expected simultaneous change of expression between enhancers and target genes. 1: the expression changes of enhancers and target genes are in the same direction; -1: the expression changes of enhancers and target genes are in the opposite direction; 0: the expression changes of enhancers and target genes are either in the same or in the opposite direction (default: 0)
`-wt [real]`	valid when -pri set to 1 for two conditions. Weight to balance the impact of binding and function evidence. The higher the weight, the bigger impact does the binding evidence have (default: 0.5, i.e., binding and functional evidence have equal impact on enhancer prioritization)
`-cf [real]`	valid when -pri set to 1 for two conditions. Cutoff to select significant transcriptional changes (default: FDR < 0.05). Use Foldchange instead if no FDR is generated (default: Foldchange > 1.5)
`-h`	help message

Some abbreviation in outputs,

Column ID	Description
chr	chromosome location of enhancers
start	the start position of enhancers
end	the end position of enhancers
center	the location of enhancer centers
FANTOM5	whether enhancer centers exist in FANTOM5
associated_gene-FANTOM5	associated genes in FANTOM5
associated_gene-50kb	associated genes within 50 Kb distance
associated_gene-4DGenome	associated genes in 4DGenome
Cell/Tissue	Cell/Tissue associated genes in 4DGenome detected in
detection_method	detection method associated genes in 4DGenome
PMID	PMID corresponding associated genes in 4DGenome
closest_gene	the closest gene to each enhancer
distance	the distance from the closest gene to the enhancer
Fscore	enhancer prioritizing score for downstream functional activity
Bscore	enhancer prioritizing score for upstream binding effect
Score	combined enhancer prioritizing score

Additional tools¶

boxplot.pl¶

It generates three boxplots: reads density in promoter-proximal regions, reads density in gene body regions, and pausing index for each sample.

Usage: perl boxplot.pl options
e.g: perl boxplot.pl -i genes_for_boxplot.txt -w pausing_result/ -n boxplot -rep1 2 -rep2 2

`-i [string]`	required, file with a list of genes (RefSeq id like, NM_001005277) user wants to plot
`-w [string]`	required, work directory, should be the same of pause_PROseq.pl’s output/work directory
`-n [string]`	required, prefix name of output figures
`-rep1 [int]`	required, the number of replicates in the first condition
`-rep2 [int]`	the number of replicates in the second condition
`-h`	help message

heatmap.pl¶

It generates a heatmap for transcription changes in condition2 vs condition1 around TSS for genes of interest.

Usage: perl heatmap.pl options
e.g: perl heatmap.pl -i genes_for_heatmap.txt -w pausing_result/ -n heatmap -in1 control1.bed control2.bed -in2 case1.bed case2.bed

`-i [string]`	required, file with a list of genes (RefSeq id like, NM_001005277) user wants to plot
`-m [string]`	define the genome: hg19, mm10, dm3, ce10, or danRer10, default: hg19
`-in1 [bed/bam]`	required, read alignment files in bed/bam format for condition1, each file is separated by space
`-in2 [bed/bam]`	read alignment files in bed/bam format for condition2, each file is separated by space
`-r [int]`	upstream and downstream distance relative to TSS for plotting PRO-seq signal (bp, default: 5000), should can be divided by the bin size
`-b [int]`	bin size (bp, default: 200), should can be divided by 2
`-w [string]`	required, work directory, should be the same of pause_PROseq.pl’s output/work directory
`-n [string]`	required, name of output figure
`-h`	help message

pindex_change-plot.pl¶

It generates a figure for pausing index changes for genes of interest in condition2 vs condition1.

Usage: perl pindex_change-plot.pl options
e.g: perl pindex_change-plot.pl -w pausing_result/ -n pindex-change -cutajp 0.001 -cutfc 2

`-w [string]`	required, work directory, should be the same of pause_PROseq.pl’s output/work directory
`-n [string]`	required, name of output figure
`-cutajp [value]`	cutoff of adjusted pvalue for seleting genes, default "0.001"
`-cutfc [value]`	cutoff of absolute value of log2 fold change for selecting genes, default "0"
`-h`	help message

signal.pl¶

This tool smooths the integration of PRO/GRO-seq results with other genomic data. It generates a histogram showing the PRO/GRO-seq signals or other sequencing data (e.g., binding or histone modification signals from ChIP-seq data) around identified enhancers or regions of interest.

Usage: perl signal.pl options
perl signal.pl -i enhancer_center.txt -w pause_out/ -n signal -in1 control1.bed control2.bed -in2 case1.bed case2.bed

`-i [string]`	required, enhancer center position file (Enhancer_center.txt-- the output of eRNA.pl), or the file of chromation location of interest (should contains at least two columns separated by tab or space, e.g "chr11 83078317")
`-d [int]`	up and down relative distance from the center for plotting PRO-seq or other signal (bp, default: 2000), should can be divided by bin size
`-b [int]`	bin size for smoothing (bp, default: 20), must be even number
`-in1 [bed/bam]`	required, read alignment files in bed/bam format for condition1, each file is separated by space
`-in2 [bed/bam]`	read alignment files in bed/bam format for condition2, each file is separated by space
`-w [string]`	required, work directory, should be the same of pause_PROseq.pl’s output directory
`-n [string]`	required, name of output figure
`-s [char]`	option to generate PRO/GRO-seq signal (p), or signal for other data such as histone modification (o) (default: p) e.g. generates PRO-seq signal around enhancer centers `perl signal.pl -i enhancer_center.txt -s p -w pause_out/ -n PROseq-signal -in1 YZ-1.bed YZ-2.bed -in2 YZ-3.bed YZ-4.bed` e.g. generates H3K4me1 signal around enhancer centers `perl signal.pl -i enhancer_center.txt -s o -w pause_out/ -n H3K4me1-signal -in1 H3K4me1-rep1.bam H3K4me1-rep2.bam`
`-h`	help message

Reference¶

To cite NRSA, please refer to our publication in BMC Genomics: Nascent RNA sequencing analysis provides insights into enhancer-mediated gene regulation. (PMID:30139328).

Contacts¶

Wang, Jing <jing.wang@vanderbilt.edu>
Liu, Qi <qi.liu@vanderbilt.edu>