A Machine Learning Approach to Combined Evidence Validation of Genome Assemblies

Several in silico measures for assembly validation have been proposed by various researchers. Using three benchmarking Drosophila draft genomes, we evaluate these techniques along with some new measures that we propose, including the good-minus-bad (GMB), the good-to-bad-ratio (RGB), the average Z-score (AZ), and the average absolute Z-score (ASZ). Our results show that the GMB measure performs better than the others in both its sensitivity and its specificity for assembly error detection. Nevertheless, no single method performs sufficiently well to reliably report the genomic regions requiring attention for further experimental verification. To utilize the advantages of all these measures, we have developed a novel machine learning approach that combines these individual measures to achieve a higher prediction accuracy (i.e. greater than 90%). Our combined evidence approach avoids the difficult and often ad hoc selection of many parameters the individual measures require, and significantly improves the overall precisions on the benchmarking data sets.

Contact

Choi, Jeong-Hyeon: jeochoi at indiana dot edu

Citation

Jeong-Hyeon Choi, Sun Kim, Haixu Tang, Justen Andrews, Don G. Gilbert, and John K. Colbourne. "A Machine Learning Approach to Combined Evidence Validation of Genome Assemblies," Bioinformatics, 2008.

Download and Install

After downloading the software package (GAV) and extracting the files by GUN tar, go to gav-1.1 and run make in a command line.
cd gav-1.1
make
Check if two files approject and casav exist.

Two data sets are provided for testing whether the install is complete.

cd demo
path/gav -i demo1.options
path/gav -i demo2.options
diff -q validate1/ml-1-1 result1 | grep -v ': 1'
diff -q validate2/ml-1-1 result2 | grep -v ': 1'
where path is the directory that the program is installed.

Manual

A. Procedure

This software works on either one assembly or two assemblies. It consists of 2 steps;
(1) make a feature file for each assembly,
(2) run machine learning using Weka.

1. Making a feature file for each assembly
As input, the program basically needs files describing the layout and sequence mate-pairs for each assembly.

- Layout file format : ACE and WUSTL (Washington University in St. Louis).
In the WUSTL file, a line specifies the layout of a read as shown in the example below
* AZSH569613.y1 37 982 0 contig_0 scaffold_8303 1 1487

Column   Description
------   ----------------------------------------------------------
  1      NCBI ti number for read (or *, if none known)
  2      read name
  3      left trimmed position on the original read
  4      number of bases in trimmed read
  5      orientation on contig (0 = forward, 1 = reverse)
  6      contig name
  7      supercontig name
  8      approximate start position of the trimmed read in the contig
  9      approximate start position of the trimmed read on supercontig

- Mate-pair file format : tab-delimitated file (TDF) and XML in NCBI trace archive.
In the TDF file, a line specifies mate-pair information as shown in the example below

AZSH1000.x1     AZSH1000.y1     3153    562

Column   Description
------   ----------------------------------------------------------
  1      read name
  2      read name
  3      expected length of this pair (library)
  4      standard deviation of this pair (library)

*Currently, the conversion from the assembly made by Arachne and PCAP to those files is supported.

- Repeat analysis file
The results of RepeatMasker and self comparison such as BLAST and MUMmer can be used optionally.

RepeatMasker  *.cat
BLAST (-m 8)  TDF
MUMmer        *.delta
As output, the format of feature table is a TDF file like below
#ID   Beg   End   RC Nclone   Ngood Nbad  Nlow  Nwrong   GMB   RGB   AZ ANZ   APZ   Self  Repeat
contig_0 501   1000  6.11  7.48  7.48  0.00  0.00  0.00  7.48  2.13  -0.33 -0.74 0.76  1.00  Y
contig_0 1001  1500  3.09  9.47  9.47  0.00  0.00  0.00  9.47  2.35  -0.40 -0.72 0.76  1.00  N
where the first column is ID, and the second and the third columns are the starting and end positions of a block, respectively. The first line must start with # and contain column names.

2. Running machine learning using Weka
As input, a TDF file for known mis-assembly and correct assembly is needed, as shown below, and is used to make a feature file for training.

contig_1   1501   2000   Y
contig_1   2501   3000   N
contig_1   6001   6500   Y
contig_1   6501   7000   N
contig_1   7001   7500   N

Column   Description
------   ----------------------------------------------------------
  1      ID
  2      starring position
  3      end position
  4      classification: Y means mis-assembly and N means correct assembly
There are three options for sampling from a feature file for training.
-sm number : sampling of mis-assembly
-st number : sampling of correct assembly
-sc number : number of sampling which is used for the different ML
If the number for -sm and -st is less than or equal to 1, then the number means the fraction.
As output, the prediction is stored to a TDF file like below
#ID        Beg   Known   1  2  3  4  5  6  7  8  9  10
contig_0   501      ?     N  N  N  N  N  N  N  N  N  N
contig_0  1001      ?     Y  Y  Y  Y  Y  Y  Y  Y  Y  Y
contig_0  1501      ?     N  N  N  N  N  N  N  N  N  N
where the first column is ID, the second is the starting position of a block, and the third is a known class. ? means that the class is unknown. The other columns represent the prediction in each sampling.

 

B. How to run

The software package has two types of front-end programs based on CUI and GUI.

1. CUI-based : PERL script (gav)
As input, the running type should be declared by -1 or -2 for one assembly or two assemblies. Also, the output directory, known class file, sampling, and classifiers should be declared.

gav <-1 | -2>  -k known -sm 1 -st 1 -sc 1 -c J48 -c RF out_dir
The supported classifiers are J48, RF, RT, NB, BN, SMO, and NN
J48  decision tree
RF   random forest
RT   random tree
NB   naive bayes
BN   baysian network
SMO  support vector machine
NN   neural network
As input, for the model feature, either both assembler and assembly directory or both layout file and mate-pair file should be given.
1) If assembler is Arachne or PCAP, then
-mi assembly_directory -ma assembler
2) Otherwise
<-mwu | -mace> layout_file <-mtable | -mxml> mate_pair_file
where either -mwu or -mace is used for the layout file and either -mtable or -mxml for the mate-pair file.
In the same way except the prefix of options, for the feature of test, either assembler and assembly director or layout file and mate-pair file should be given.
1) If assembler is Arachne or PCAP, then
-ti assembly_directory -ta assembler
2) Otherwise
<-twu | -tace> layout_file <-ttable | -txml> mate_pair_file
where either -twu or -tace is used for the layout file and either -ttable or -txml for the mate-pair file.
If you have an option file which contains all options needed like below
Validate
-mi Assembly
-ma Dmoj/Arachne
-mr RepeatMasker/dmoj.fa.cat
-mm Self/dmoj.delta
-ti Dvir/Assembly
-ta Arachne
-tr RepeatMasker/dvir.fa.cat
-tm Self/dvir.delta
-k yyy
-sm 0.5
-st 0.5
-sc 100
-c J48
-c RF
-2
where line breaks don't matter
then, gav -i option_file.
If you have an option file needed for model feature, then -model option_file. Similarly, -test option_file for test feature.
*The script automatically makes three option files for all, model, and test.

If you already have a feature file for model and test, then -Model feature_file and -Test feature_file.
*The script keeps feature files for model and test in the files model/model.feature and test/test.feature in the output directory, respectively.

2. GUI-based : JAVA application (GAV.jar)
The JAVA application interacts with users to set options, run a job, and show the status.
java -jar GAV.jar

 

C. Output files

All files are stored in the output directory specified by user.
File Name                     Description
---------------------------   ------------------------------------------------------
model/model.feature           feature file for model assembly
test/test.feature             feature file for test assembly
class.feature                 feature file combined by model and known class
final.feature                 feature file for final test by adding the column class
ml-*-*/weka.*.prediction      prediction files by classifiers with sampling
The last files are formatted like below
contig_0 501   ?  Y
contig_0 1001  ?  N
contig_0 1501  ?  N
contig_0 6001  n  M
contig_0 6501  ?  Y
contig_0 7001  ?  N

Column   Description
-------  ------------------------------------------------------
  1      contig ID
  2      start position of a block
  3      known class if given, ? otherwise
  4      predicted classification
         M for model, Y for mis-assembly, N for correct assembly

Jeong-Hyeon Choi @ The Center for Genomics and Bioinformatics, Indiana University, Bloomington, Indiana, USA
Last updated on 10/7/2007