cd gav-1.1 makeCheck if two files approject and casav exist.
Two data sets are provided for testing whether the install is complete.
cd demo path/gav -i demo1.options path/gav -i demo2.options diff -q validate1/ml-1-1 result1 | grep -v ': 1' diff -q validate2/ml-1-1 result2 | grep -v ': 1'where path is the directory that the program is installed.
1. Making a feature file for each assembly
As input, the program basically needs files describing the layout and sequence mate-pairs for each assembly.
- Layout file format : ACE and WUSTL (Washington University in St. Louis).As output, the format of feature table is a TDF file like below
In the WUSTL file, a line specifies the layout of a read as shown in the example below* AZSH569613.y1 37 982 0 contig_0 scaffold_8303 1 1487 Column Description ------ ---------------------------------------------------------- 1 NCBI ti number for read (or *, if none known) 2 read name 3 left trimmed position on the original read 4 number of bases in trimmed read 5 orientation on contig (0 = forward, 1 = reverse) 6 contig name 7 supercontig name 8 approximate start position of the trimmed read in the contig 9 approximate start position of the trimmed read on supercontig- Mate-pair file format : tab-delimitated file (TDF) and XML in NCBI trace archive.
In the TDF file, a line specifies mate-pair information as shown in the example below
AZSH1000.x1 AZSH1000.y1 3153 562 Column Description ------ ---------------------------------------------------------- 1 read name 2 read name 3 expected length of this pair (library) 4 standard deviation of this pair (library)*Currently, the conversion from the assembly made by Arachne and PCAP to those files is supported.
- Repeat analysis file
The results of RepeatMasker and self comparison such as BLAST and MUMmer can be used optionally.RepeatMasker *.cat BLAST (-m 8) TDF MUMmer *.delta
#ID Beg End RC Nclone Ngood Nbad Nlow Nwrong GMB RGB AZ ANZ APZ Self Repeat contig_0 501 1000 6.11 7.48 7.48 0.00 0.00 0.00 7.48 2.13 -0.33 -0.74 0.76 1.00 Y contig_0 1001 1500 3.09 9.47 9.47 0.00 0.00 0.00 9.47 2.35 -0.40 -0.72 0.76 1.00 Nwhere the first column is ID, and the second and the third columns are the starting and end positions of a block, respectively. The first line must start with # and contain column names.
2. Running machine learning using Weka
As input, a TDF file for known mis-assembly and correct assembly is needed, as shown below,
and is used to make a feature file for training.
There are three options for sampling from a feature file for training.contig_1 1501 2000 Y contig_1 2501 3000 N contig_1 6001 6500 Y contig_1 6501 7000 N contig_1 7001 7500 N Column Description ------ ---------------------------------------------------------- 1 ID 2 starring position 3 end position 4 classification: Y means mis-assembly and N means correct assembly
As output, the prediction is stored to a TDF file like below-sm number : sampling of mis-assembly -st number : sampling of correct assembly -sc number : number of sampling which is used for the different MLIf the number for -sm and -st is less than or equal to 1, then the number means the fraction.
#ID Beg Known 1 2 3 4 5 6 7 8 9 10 contig_0 501 ? N N N N N N N N N N contig_0 1001 ? Y Y Y Y Y Y Y Y Y Y contig_0 1501 ? N N N N N N N N N Nwhere the first column is ID, the second is the starting position of a block, and the third is a known class. ? means that the class is unknown. The other columns represent the prediction in each sampling.
1. CUI-based : PERL script (gav)
As input, the running type should be declared by -1 or -2 for one assembly or two assemblies.
Also, the output directory, known class file, sampling, and classifiers should be declared.
As input, for the model feature, either both assembler and assembly directory or both layout file and mate-pair file should be given.gav <-1 | -2> -k known -sm 1 -st 1 -sc 1 -c J48 -c RF out_dirThe supported classifiers are J48, RF, RT, NB, BN, SMO, and NNJ48 decision tree RF random forest RT random tree NB naive bayes BN baysian network SMO support vector machine NN neural network
1) If assembler is Arachne or PCAP, thenIn the same way except the prefix of options, for the feature of test, either assembler and assembly director or layout file and mate-pair file should be given.-mi assembly_directory -ma assembler2) Otherwise<-mwu | -mace> layout_file <-mtable | -mxml> mate_pair_filewhere either -mwu or -mace is used for the layout file and either -mtable or -mxml for the mate-pair file.
1) If assembler is Arachne or PCAP, thenIf you have an option file which contains all options needed like below-ti assembly_directory -ta assembler2) Otherwise<-twu | -tace> layout_file <-ttable | -txml> mate_pair_filewhere either -twu or -tace is used for the layout file and either -ttable or -txml for the mate-pair file.
then, gav -i option_file.Validate -mi Assembly -ma Dmoj/Arachne -mr RepeatMasker/dmoj.fa.cat -mm Self/dmoj.delta -ti Dvir/Assembly -ta Arachne -tr RepeatMasker/dvir.fa.cat -tm Self/dvir.delta -k yyy -sm 0.5 -st 0.5 -sc 100 -c J48 -c RF -2where line breaks don't matter
If you already have a feature file for model and test, then -Model feature_file
and -Test feature_file.
*The script keeps feature files for model and test in the files model/model.feature
and test/test.feature in the output directory, respectively.
2. GUI-based : JAVA application (GAV.jar)
The JAVA application interacts with users to set options, run a job, and show the status.
java -jar GAV.jar
File Name Description --------------------------- ------------------------------------------------------ model/model.feature feature file for model assembly test/test.feature feature file for test assembly class.feature feature file combined by model and known class final.feature feature file for final test by adding the column class ml-*-*/weka.*.prediction prediction files by classifiers with samplingThe last files are formatted like belowcontig_0 501 ? Y contig_0 1001 ? N contig_0 1501 ? N contig_0 6001 n M contig_0 6501 ? Y contig_0 7001 ? N Column Description ------- ------------------------------------------------------ 1 contig ID 2 start position of a block 3 known class if given, ? otherwise 4 predicted classification M for model, Y for mis-assembly, N for correct assembly