Commande line usage with standalone executables
The command line pipeline for the DIYABC-RF
inference framework is based on two command-line softwares:
diyabc and abcranger.
Please visit the following pages to get the latest releases:
DIYABC-RF
pipeline
The DIYABC-RF
pipeline includes two step:
-
Training set simulations (c.f. dedicated section)
-
Random forest analysis (c.f. dedicated section)
You can also use DIYABC-RF
simulation engine to generate synthetic data files for your benchmark or evaluation experiments (c.f. dedicated section).
Input data and configuration file
Please check the documentation and the toy examples about the supported input data (SNP, Poolseq, Microsat/Sequence) and how to write the corresponding header.txt
configuration file.
You can also check this repository and especially this README containing the data and configuration files related to data analysis made with DIYABC-RF
.
A DIYABC-RF
project is a directory containing:
- your data file
DIYABC-RF
configuration files, including header files (header.txt
orheaderRF.txt
)- any potential files generated during the pipeline run, including the random seed file (
RNG_state_0000.bin
), the training set files (reftableRF.bin
,statobsRF.bin
) and other result files (c.f. below).
Now we assume that we are working from a DIYABC-RF
project directory.
Training set simulation
To simulate the training set for the random forest (based on your data set), you need the diyabc-RF
program for your OS from here, named diyabc-RF-<os>-<version>
(where <os>
is your OS and <version>
is the software version, e.g. diyabc-RF-linux-v1.1.27
).
Then you need to write a header.txt
(or headerRF.txt
) configuration file describing your project configuration (c.f. previous section).
You can also use the GUI to generate the header.txt
file.
From your project directory (containing at least your data file and your header file), you can run do the following:
- Generate the seed for random data generation:
diyabc-RF-<os>-<version> -p ./ -n "t:<n_core>"
- Training set simulation:
diyabc-RF-<os>-<version> -p ./ -R "ALL" -m -t <n_core> -g <batch_size> -r <n_simu>
- (Optional) Prior and scenario checking:
diyabc-RF-<os>-<version> -p ./ -d a:pl
Options:
<n_core>
(integer): number of cores to run the program in parallel mode (it is highly recommended to use parallel computations to reduce the simulation time).<batch_size>
(integer): number of particles simulated in a single batch (loop-size), default is100
.<n_simu>
(integer): training set final size.
Note: you can increase the size of an existing data set by running again the program with a larger training set size (
<n_simu>
).
You can run diyabc-RF-<os>-<version> --help
for more details about the program options and input parameters.
The prior and scenario checking step does a projection of the observed data set and of the simulated training set (through a PCA) to give an insight about the consistency between your data set and the simulated data based on the historical models and the chosen values for the parameter priors that you chose. Results can be found in the files prefixed with pcaloc
.
Random forest analysis
With the random forest analysis, based on the training set, you can either run a model selection procedure, or estimate a parameter (or a combination of parameters) from a given model.
To run the random forest analysis, you need the abcranger
program for your OS, available from here, named abcranger-<os>-<version>
(where <os>
is your OS and <version>
is the software version, e.g. abcranger-linux-v1.2.64
).
You can run abcranger-<os>-<version> --help
for more details about the program options and input parameters.
Requirements
To run the following analyses, you will need to work from a complete project directory containing at least:
- your data file
- a
headerRF.txt
file (training set simulation configuration, generated bydiyabc-RF
program) - a
reftableRF.txt
file (training set simulated data, generated bydiyabc-RF
program) - a
statobsRF.txt
file (the value of summary statistics computed on your data file, generated bydiyabc-RF
program)
Model selection
To run the model selection procedure, you can run the following command from your project directory:
abcranger-<os>-<version> -t <n_tree> -j <n_core>
Options:
<n_tree>
(integer): number of trees in the random forest.<n_core>
(integer): number of cores to run the program in parallel mode (it is highly recommended to use parallel computations to reduce the simulation time).
Output: results are written in a collection of files prefixed with modelchoice_out.
in the project directory.
Parameter estimation
To run, the parameter estimation procedure, you can run the following command from your project directory:
abcranger-<os>-<version> -t <n_tree> -j <n_core> --parameter <param_name> --chosenscen <id>
Options:
<n_tree>
(integer): number of trees in the random forest.<n_core>
(integer): number of cores to run the program in parallel mode (it is highly recommended to use parallel computations to reduce the simulation time).<param_name>
(alphanum string): name of the parameter to estimate in the model corresponding to the chosen scenario.<id>
(integer): index of the scenario (starting from 1) among the models listed in theheader.txt
orheaderRF.txt
file from which you want to estimate a parameter.
Output: results are written in a collection of files prefixed with estimparam_out.
in the project directory.
Example
Here is an example of an analysis.
# init data generation seeds
diyabc -p ./ -n "t:8"
# training set simulations
diyabc -p ./ -R "ALL" -m -g 50 -r 2000 -t 8
# prior checking
diyabc -p ./ -d a:pl
# parameter estimation
abcranger -t 1000 -j 8 --parameter N1 --chosenscen 1 --noob 50
# model choice
abcranger -t 1000 -j 8
Synthetic data generation
You can also use diyabc
program to generate synthetic data for your benchmark or evaluation experiments.
You need to write a headersim.txt
configuration file describing your data and the model that will be used to simulate them.
You can also use the GUI to generate the headersim.txt
file.
To generate the data file, you can run the following commands from your project directory:
# init data generation seeds
diyabc -p ./ -n "t:8"
# synthetic data generation
diyabc-rf-<os>-<version> -p ./ -k
Here are some examples of headersim.txt
files:
- SNP data:
SNP_sim_dataset_4POP 1 0.5
4 samples
5 5
5 5
5 5
5 5
scenario (8)
N1 N2 N3 N4
0 sample 1
0 sample 2
0 sample 3
0 sample 4
t431 split 4 1 3 ra
t32 merge 2 3
t21 merge 1 2
historical parameters (8)
N1 N 7000
N2 N 2000
N3 N 4000
N4 N 3000
t431 T 200
ra A 0.3
t32 T 300
t21 T 500
loci description (100)
100 <A> [P] G1
- SNP poolseq data:
poolseq_sim_dataset_4POP_cov100 1 0.5
4 samples
50 50
50 50
50 50
50 50
scenario (8)
N1 N2 N3 N4
0 sample 1
0 sample 2
0 sample 3
0 sample 4
t431 split 4 1 3 ra
t32 merge 2 3
t21 merge 1 2
historical parameters (8)
N1 N 7000
N2 N 2000
N3 N 4000
N4 N 3000
t431 T 200
ra A 0.3
t32 T 300
t21 T 500
loci description (100)
100 <A> [P] G1