README for RSEM-EVAL

Bo Li (bli25 at berkeley dot edu)

Introduction

RSEM-EVAL is built off of RSEM. It is a reference-free de novo transcriptome assembly evaluator. This document will only cover the RSEM-EVAL only features. For the shared feature with RSEM, please refer to 'README_RSEM.md'.

Compilation & Installation

To compile RSEM-EVAL, simply run

make

To install, simply put the rsem directory in your environment's PATH variable.

Prerequisites

C++, Perl and R are required to be installed.

To take advantage of RSEM-EVAL's built-in support for the Bowtie alignment program, you must have Bowtie installed.

Usage

I. Build an assembly from the RNA-Seq data using an assembler

Please note that the RNA-Seq data set used to build the assembly should be exactly the same as the RNA-Seq data set for evaluating this assembly.

II. Estimate Transcript Length Distribution Parameters

RSEM-EVAL provides a script 'rsem-eval-estimate-transcript-length-distribution' to estimate transcript length distribution from a set of transcript sequences. Transcripts can be from a closely related species to the orgaism whose transcriptome is sequenced. Its usage is:

rsem-eval-estimate-transcript-length-distribution input.fasta parameter_file

input.fasta is a multi-FASTA file contains all transcript sequences used to learn the true transcript length distribution's parameters.
parameter_file records the learned parameters--the estimated mean and standard deviation (separated by a tab character).

We have some learned paramter files stored at folder 'true_transcript_length_distribution'. Please refer to 'true_length_dist_mean_sd.txt' in the folder for more details.

III. Calculate the RSEM-EVAL score

To calculate the RSEM-EVAL score, you should use 'rsem-eval-calculate-score'. Run

rsem-eval-calculate-score --help

to get usage information.

IV. Outputs related to the evaluation score

RSEM-EVAL produces the following three score related files: 'sample_name.score', 'sample_name.isoforms.results' and 'sample_name.genes.results'.

'sample_name.score' stores the evaluation score for the evaluated assembly. It contains 14 lines and each line contains a name and a value separated by a tab.

The first 6 lines provide: 'Score', the RSEM-EVAL score; 'BIC_penalty', the BIC penalty term; 'Prior_score_on_contig_lengths', the log score of priors of contig lengths; 'Prior_score_on_contig_sequences', the log score of priors of contig sequence bases; 'Data_likelihood_in_log_space_without_correction', the RSEM log data likelihood calculated with contig-level read generating probabilities mentioned in the supplement text of our DETONATE manuscript; 'Correction_term', the correction term. Score = BIC_penalty + Prior_score_on_contig_lengths + Prior_score_on_contig_sequences + Data_likelihood_in_log_space_without_correction - Correction_term.

The next 8 lines provide statistics that may help users to understand the RSEM-EVAL score better. They are: 'Number_of_contigs', the number of contigs contained in the assembly; 'Expected_number_of_ali gned_reads_given_the_data', the expected number of reads assigned to each contig estimated using the contig-level read generating probabilities mentioned in the supplement text of our DETONATE manuscri pt; 'Number_of_contigs_smaller_than_expected_read/fragment_length', the number of contigs whose length is smaller than the expected read/fragment length; 'Number_of_contigs_with_no_read_aligned_to', the number of contigs whose expected number of aligned reads is smaller than 0.005; 'Maximum_data_likelihood_in_log_space', the maximum data likelihood in log space calculated from RSEM by treating the assembly as "true" transcripts; 'Number_of_alignable_reads', the number of reads that have at least one alignment found by the aligner (Because 'rsem-calculate-expression' tries to use a very loose criteria to find alignments, reads with only low quality alignments may also be counted as alignable reads here); 'Number_of_alignments_in_total', the number of total alignments found by the aligner; 'Transcript_length_distribution_related_factors', the term related to transcript length distribution in 'Prior_score_on_contig_lengths' (it is the summation of log c_{\lambda}(\ell) terms mentioned in 'Calculation of the contig length distribution' subsection of the supplementary text of our DETONATE manuscript.

'sample_name.score.isoforms.results' and 'sample_name.score.genes.results' output "corrected" expression levels based on contig-level read generating probabilities mentioned in the supplement of the DETONATE manuscript. Unlike 'sample_name.isoforms.results' and 'sample_name.genes.results', which are calculated by treating the contigs as true transcripts, calculating 'sample_name.score.isoforms.results' and 'sample_name.score.genes.results' involves first estimating expected read coverage for each contig and then convert the expected read coverage into contig-level read generating probabilities. This procedure is aware of that provided sequences are contigs and gives better expression estimates for very short contigs. In addtion, the 'TPM' field is changed to 'CPM' field, which stands for contig per million.

For 'sample_name.score.isoforms.results', one additional column is added. The additional column is named as 'contig_impact_score' and gives the contig impact score for each contig as described in the DETONATE manuscript.

Example

We have a toy example in folder 'examples'. The true transcript is stored at file 'toy_ref.fa'. The single-end, 76bp reads generated from this transcript are stored in file 'toy_SE.fq'. In addition, we have three different assemblies based on the data: 'toy_assembly_1.fa', 'toy_assembly_2.fa' and 'toy_assembly_3.fa'. We also know the true transcript is from mouse and thus use 'mouse.txt' under 'true_transcript_length_distribution' as our transcript length parameter file.

We run

rsem-eval-calculate-score -p 8 \
			      --transcript-length-parameters true_transcript_length_distribution/mouse.txt \
		      examples/toy_SE.fq \
		      examples/toy_assembly_1.fa \
		      toy_assembly_1
		      76

to obtain the RSEM-EVAL score.

The RSEM-EVAL score can be found in 'toy_assembly_1.score'. The contig impact scores can be found in 'toy_assembly_1.score.isoforms.results'.

Authors

RSEM-EVAL is developed by Bo Li, with substaintial technical input from Colin Dewey and Nate Fillmore.

Acknowledgements

Please refer to the acknowledgements section in 'README_RSEM.md'.

License

RSEM-EVAL is licensed under the GNU General Public License v3.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
boost		boost
examples		examples
libs/smart_ptr/src		libs/smart_ptr/src
sam		sam
true_transcript_length_distribution		true_transcript_length_distribution
AlignerRefSeqPolicy.h		AlignerRefSeqPolicy.h
BamConverter.h		BamConverter.h
BamWriter.h		BamWriter.h
Buffer.h		Buffer.h
COPYING		COPYING
CalcEvalScore.h		CalcEvalScore.h
ContigLengthDist.h		ContigLengthDist.h
EM.cpp		EM.cpp
FOR_CYGWIN_USERS		FOR_CYGWIN_USERS
GTFItem.h		GTFItem.h
Geometric.h		Geometric.h
Gibbs.cpp		Gibbs.cpp
GroupInfo.h		GroupInfo.h
HitContainer.h		HitContainer.h
HitWrapper.h		HitWrapper.h
LenDist.h		LenDist.h
Makefile		Makefile
Model.h		Model.h
ModelParams.h		ModelParams.h
NoiseProfile.h		NoiseProfile.h
NoiseQProfile.h		NoiseQProfile.h
Orientation.h		Orientation.h
PairedEndHit.h		PairedEndHit.h
PairedEndModel.h		PairedEndModel.h
PairedEndQModel.h		PairedEndQModel.h
PairedEndRead.h		PairedEndRead.h
PairedEndReadQ.h		PairedEndReadQ.h
Poisson.h		Poisson.h
PolyARules.h		PolyARules.h
Profile.h		Profile.h
QProfile.h		QProfile.h
QualDist.h		QualDist.h
README.md		README.md
README_RSEM.md		README_RSEM.md
RSPD.h		RSPD.h
Read.h		Read.h
ReadIndex.h		ReadIndex.h
ReadReader.h		ReadReader.h
RefSeq.h		RefSeq.h
RefSeqPolicy.h		RefSeqPolicy.h
Refs.h		Refs.h
SamParser.h		SamParser.h
SingleHit.h		SingleHit.h
SingleModel.h		SingleModel.h
SingleQModel.h		SingleQModel.h
SingleRead.h		SingleRead.h
SingleReadQ.h		SingleReadQ.h
Transcript.h		Transcript.h
Transcripts.h		Transcripts.h
WHAT_IS_NEW		WHAT_IS_NEW
WriteResults.h		WriteResults.h
bam2readdepth.cpp		bam2readdepth.cpp
bam2wig.cpp		bam2wig.cpp
bc_aux.h		bc_aux.h
buildReadIndex.cpp		buildReadIndex.cpp
calcCI.cpp		calcCI.cpp
extract-transcript-to-gene-map-from-trinity		extract-transcript-to-gene-map-from-trinity
extractRef.cpp		extractRef.cpp
getUnique.cpp		getUnique.cpp
model_file_description.txt		model_file_description.txt
my_assert.h		my_assert.h
parseIt.cpp		parseIt.cpp
preRef.cpp		preRef.cpp
randomc.h		randomc.h
rsem-eval-calculate-score		rsem-eval-calculate-score
rsem-eval-estimate-transcript-length-distribution		rsem-eval-estimate-transcript-length-distribution
rsem-gen-transcript-plots		rsem-gen-transcript-plots
rsem-plot-model		rsem-plot-model
rsem-plot-transcript-wiggles		rsem-plot-transcript-wiggles
rsem_perl_utils.pm		rsem_perl_utils.pm
samValidator.cpp		samValidator.cpp
sam_rsem_aux.h		sam_rsem_aux.h
sam_rsem_cvt.h		sam_rsem_cvt.h
sampling.h		sampling.h
scanForPairedEndReads.cpp		scanForPairedEndReads.cpp
simul.h		simul.h
simulation.cpp		simulation.cpp
synthesisRef.cpp		synthesisRef.cpp
tbam2gbam.cpp		tbam2gbam.cpp
utils.h		utils.h
wiggle.cpp		wiggle.cpp
wiggle.h		wiggle.h

License

nfillmore/rsem-eval

Folders and files

Latest commit

History