Reptile Tutorial

Daniel S. Standage
http://standage.public.iastate.edu
Bioinformatics and Computational Biology Graduate Program
Iowa State University

Getting started

The Reptile manuscript references several short read data sets (available at the NCBI sequence read archive) that were used for testing. I searched for, found, and downloaded the .fastq file corresponding to the accession number SRR022918. I then created a directory for this tutorial and placed the .fastq file in that directory.

standage@local:~$ cd
standage@local:~$ mkdir ReptileTest
standage@local:~$ mkdir ReptileTest/data
standage@local:~$ mv SRR022918.fastq ReptileTest/data/.
standage@local:~$ cd ReptileTest
standage@local:~/ReptileTest$ cd ReptileTest

Note: For the purposes of this tutorial, I assume that all programs have been compiled and copied (along with the conversion script) to the user's path. I followed the compilation and installation instructions provided in the README file and have included the terminal output at the end of the tutorial for anyone that may have questions.

Preprocessing data

The first step to error correction with Reptile is getting the data into the correct format. Most short read data these days is stored in .fastq format, so they must be converted to a pair of sequence and quality files, respectively. A conversion script is provided in the source distribution–see the README file for details and the usage statement.

The manuscript said that all reads with ambiguous characters were removed for fair comparison with other software, so I selected that option as well. This is the command I used to convert the data.

standage@local:~/ReptileTest$ perl fastq-converter-v2.0.pl data/ data/ 1
process file: 0: SRR022918.fastq

Total number of ambiguous nucleotides(e.g., N) changed to : 0 

Done!! 7119606 reads remaining; 7289025 discarded; 1 files

standage@local:~/ReptileTest$

This command created two files in the data directory: SRR022918.fa (the sequence file) and SRR022918.q (the quality file). The data are now in the expected format for Reptile.

Parameter tuning

Some Reptile parameters are very dependent on the particular data set being studied. Reptile provides a tool for analyzing the data to determine the optimal values for these parameters. The idea is to do an initial run of the seq analysis tool, observe the distributions of reads and qualities, make any necessary configuration adjustments, and re-run the seq analysis tool. The final results are then used to set Reptile parameters.

First, make a copy of the example config file provided in the source distribution…

standage@local:~/ReptileTest$ cp /usr/local/reptile-v1.1/utils/seq-analy/config-example 
                              seq-analy-config
standage@local:~/ReptileTest$

…and then edit the file to reference the appropriate input files (the '.fa' and '.q' file) and output files (to be generated–choose whichever filename you like). This is what my config file looks like (sans comments).

IFlag                   1
BatchSize               1000000
InFaFile                data/SRR022918.fa
IQFile                  data/SRR022918.q
KmerLen                 24
OKmerHistFile           data/SRR022918.kmerhist
QHistFile               data/SRR022918.qualhist
OKmerCntFile    
MaxErrRate              0.02
QThreshold              73
MaxBadQPerKmer          4
QFlag                   1

Once the config file is ready, you can run the seq analysis program simply by entering this command.

seq-analy seq-analy-config

This command creates two files: a histogram of quality scores and a histogram of tile occurrences (see README for details). The results of this initial analysis are used to update configuration settings and re-run the analysis tool.

  • The README file says to set MaxBadQPerKmer to a value between 4-6 for 35bp reads, or between 6-10 for 100+bp reads. Since these reads are 48bp, I decided to leave this parameter at its default value of 4.
  • The README file says to set KmerLen to the tile length you want to use for Reptile. I am going to use 24, so I also left this parameter at its default value.
  • The README file says to set QThreshold to the quality score qs such that 80% of the reads have a quality value < qs (derived from the quality histogram file–see README for more details). Only 25% of the reads have quality < 72, but 100% of the reads have quality < 73. I therefore left this parameter at its default value as well.

After updating these parameters, it's typical to re-run the seq analysis program. However, since none of the default parameters changed for this analysis, we can simply use these results to choose parameters for Reptile.

Now, make a copy of the example Reptile configuration file provided in the source distribution.

standage@local:~/ReptileTest$ cp /usr/local/reptile-v1.1/src/config-example reptile-config
standage@local:~/ReptileTest$

The README file provides detailed instructions for selecting parameters settings for this config file using the histograms created by the seq analysis tool. I will not reiterate those instructions here, but this is the config file I ended up with (sans comments) when all was said and done.

InFaFile                data/SRR022918.fa
QFlag                   1
IQFile                  data/SRR022918.q
OErrFile                data/SRR022918.errors
BatchSize               1000000 
KmerLen                 12
hd_max                  1
Step                    12
ExpectSearch            16 
T_ratio                 0.5

######## The following parameters need to be tuned to specific dataset ########

QThreshold              73
MaxBadQPerKmer          4
Qlb                     62
T_expGoodCnt            16
T_card                  6

I then ran Reptile using the following command.

reptile-v1.1 reptile-config

This command creates a file (data/SRR022918.errors in my case) containing information about every error identified in the short read data.

Generating corrected sequences

It's nice to know where the errors are in your short read data, but it's more helpful to actually have the corrected sequences for assembly or other downstream uses. The merger program included in the source distribution can be used to generate the corrected sequences. Below is the command I used to generate the corrected data.

reptile_merger data/SRR022918.fa data/SRR022918.errors data/SRR022918.corrected.fa

Appendix: installation

This is my terminal output from installing Reptile on my desktop.

standage@local:~$ sudo mv reptile-v1.1.tar.bz2 /usr/local/
standage@local:~$ cd /usr/local
standage@local:/usr/local$ sudo bunzip2 reptile-v1.1.tar.bz2 
standage@local:/usr/local$ sudo tar xf reptile-v1.1.tar 
standage@local:/usr/local$ cd reptile-v1.1/src/
standage@local:/usr/local/reptile-v1.1/src$ make all
g++ -c -Wall -O3   Kmer.cpp   
g++ -c -Wall -O3   main.cpp   
In file included from main.cpp:33:
Parser.h: In function ‘void print_kcvec(const std::string&, const kcvec_t&, int)’:
Parser.h:114: warning: comparison between signed and unsigned integer expressions
g++ -c -Wall -O3   Parser.cpp   
In file included from Parser.cpp:30:
Parser.h: In function ‘void print_kcvec(const std::string&, const kcvec_t&, int)’:
Parser.h:114: warning: comparison between signed and unsigned integer expressions
Parser.cpp: In member function ‘void Parser::tableMaker(const Para&)’:
Parser.cpp:205: warning: comparison between signed and unsigned integer expressions
Parser.cpp: In member function ‘void Parser::updateRead(std::string&, int)’:
Parser.cpp:350: warning: comparison between signed and unsigned integer expressions
Parser.cpp: In function ‘void candidates(ivec_t&, const kcvec_t&, int)’:
Parser.cpp:369: warning: comparison between signed and unsigned integer expressions
Parser.cpp: In member function ‘bool Parser::errorCall(const upair_t&, const ipair_t&, const ipair_t&, char*, const Para&)’:
Parser.cpp:459: warning: comparison between signed and unsigned integer expressions
Parser.cpp:464: warning: comparison between signed and unsigned integer expressions
Parser.cpp:471: warning: comparison between signed and unsigned integer expressions
Parser.cpp:516: warning: comparison between signed and unsigned integer expressions
Parser.cpp: In function ‘bool checkPoint(const ivec_t&, int)’:
Parser.cpp:531: warning: comparison between signed and unsigned integer expressions
Parser.cpp: In member function ‘void Parser::unitNeighbor(uvec_t&, int, const Para&)’:
Parser.cpp:543: warning: comparison between signed and unsigned integer expressions
Parser.cpp:548: warning: comparison between signed and unsigned integer expressions
Parser.cpp:559: warning: comparison between signed and unsigned integer expressions
Parser.cpp:611: warning: comparison between signed and unsigned integer expressions
Parser.cpp: In function ‘void updateNodes(kcvec_t&, kcvec_t&, const kcvec_t&, int)’:
Parser.cpp:626: warning: comparison between signed and unsigned integer expressions
Parser.cpp:631: warning: comparison between signed and unsigned integer expressions
Parser.cpp:636: warning: comparison between signed and unsigned integer expressions
Parser.cpp: In member function ‘void Parser::mergeTiles(kcvec_t&, kcvec_t&, const Para&)’:
Parser.cpp:656: warning: comparison between signed and unsigned integer expressions
Parser.cpp:680: warning: comparison between signed and unsigned integer expressions
Parser.cpp:696: warning: comparison between signed and unsigned integer expressions
Parser.cpp: In member function ‘void Parser::tiling(kcvec_t&, const uvec_t&, const uvec_t&, const Para&)’:
Parser.cpp:726: warning: comparison between signed and unsigned integer expressions
Parser.cpp:727: warning: comparison between signed and unsigned integer expressions
Parser.cpp: In member function ‘void Parser::output(const std::string&)’:
Parser.cpp:769: warning: comparison between signed and unsigned integer expressions
Parser.cpp:773: warning: comparison between signed and unsigned integer expressions
g++ -c -Wall -O3   Seq.cpp   
g++  Kmer.o main.o Parser.o Seq.o -o reptile-v1.1
standage@local:/usr/local/reptile-v1.1/src$ sudo cp reptile-v1.1 /usr/local/bin/.
standage@local:/usr/local/reptile-v1.1/src$ cd ../utils/
standage@local:/usr/local/reptile-v1.1/utils$ sudo cp fastq-converter-v2.0.pl /usr/local/bin/.
standage@local:/usr/local/reptile-v1.1/utils$ cd seq-analy/
standage@local:/usr/local/reptile-v1.1/utils/seq-analy$ make all
g++ -c -Wall -O3   Analy.cpp   
g++ -c -Wall -O3   Kmer.cpp   
g++ -c -Wall -O3   main.cpp   
g++ -c -Wall -O3   Seq.cpp   
Seq.cpp: In member function ‘void Seq::qualHist(const Para&)’:
Seq.cpp:73: warning: comparison between signed and unsigned integer expressions
g++  Analy.o Kmer.o main.o Seq.o -o seq-analy
standage@local:/usr/local/reptile-v1.1/utils/seq-analy$ sudo cp seq-analy /usr/local/bin/.
standage@local:/usr/local/reptile-v1.1/utils/seq-analy$ cd ../reptile_merger/
standage@local:/usr/local/reptile-v1.1/utils/reptile_merger$ make all
g++ -Wall -I. reptile_merger.cpp -o reptile_merger 
standage@local:/usr/local/reptile-v1.1/utils/reptile_merger$ sudo cp reptile_merger /usr/local/bin/.
standage@local:/usr/local/reptile-v1.1/utils/reptile_merger$
reptile_tutorial.txt · Last modified: 2011/04/04 12:34 by xyang