In this notebook I show the workflow process to simulate family data with genotypes generated conditional on affection status using RarePedSim. And use RV-NPL to analyze the simulated families.
Simulate family data using RarePedSim assuming a multiplicative disease model with the output of VCF file. (The parameters for RarePedSim are located in a configuration file generated by workflow below)
Then use "collapse" command from rvnpl to generate CHP regional markers for the genotypes in pedigrees
Finally, use "npl" command from rvnpl to analyze RV-NPL score on the genotypes for CHP markers
The default workflow can be executed by doing the following command:
sos run analysis/NPL_Simulation.ipynb simulate
Or, we can simulate only a proportion of functional rare variants contribute to the disease (e.g. 50%):
sos run analysis/NPL_Simulation.ipynb simulate --name Prop50 --proportion 0.5
Or, we can simulate genotypes under the null by setting odds ratio (OR)=1:
sos run analysis/NPL_Simulation.ipynb simulate --name null --OR 1.0
*For the SFS files used here, the variant information was downloaded from gnomAD website. The non-Finnish European allele frequency is used for the MAF column. The value for the "function score" column is based on the functional annotations from gnomAD.
[global]
# Disease model scenario
parameter: name = 'Prop100'
# proportion of functional variants that contribute to the disease
parameter: proportion = 'None'
# Odds Ratio
parameter: OR = 5.0
# model: LOGIT for qualitative traits or LNR for quantitative traits
parameter: model = 'LOGIT'
# Path to the ped file (6-column PED format)
parameter: ped_file = path('data/100extend01.ped')
# Path to list of genes
parameter: gene_list = path('data/genes.txt')
# the output directory for VCF file
parameter: out_dir = path('output')
# gene names
genes = paths([f'{gene_list:d}/{x.strip()}.sfs' for x in open(gene_list).readlines()])
First, we will need to generate the configuration file for the disease model which will be used in the simulation software RarePedSim.
[make_config: provides = f'{out_dir}/{name}.conf']
# conf file contains the simulation specifications (either Mendelian or Complex, details in RarePedSim doc)
output: f'{out_dir}/{name}.conf'
report: expand=True, output=_output
trait_type=Complex
[model]
model={model}
[quality control]
def_rare=0.01
rare_only=True
def_neutral=(-1E-5, 1E-5)
def_protective=(-1, -1E-5)
[phenotype parameters]
baseline_effect=0.01
moi=MAV
proportion_causal={proportion}
[LOGIT model]
OR_rare_detrimental={OR}
OR_rare_protective=None
ORmax_rare_detrimental=None
ORmin_rare_protective=None
OR_common_detrimental=None
OR_common_protective=None
[LNR model]
meanshift_rare_detrimental=0.0
meanshift_rare_protective=None
meanshiftmax_rare_detrimental=None
meanshiftmax_rare_protective=None
meanshift_common_detrimental=None
meanshift_common_protective=None
[genotyping artifact]
missing_low_maf=None
missing_sites=None
missing_calls=None
error_calls=None
[other]
max_vars=2
ascertainment_qualitative=(0,0)
ascertainment_quantitative=((0,~),(0,~))
Here, we use RarePedSim to generate genotypes for families with given affection status based on user-specified disease model in the configuration file.
The output file is a VCF file and we need to tabix it before the next step.
[simulate_1 (rarepedsim)]
depends: f'{out_dir}/{name}.conf'
input: for_each = 'genes'
output: f'{out_dir}/{_genes:bn}.vcf.gz'
bash: container = 'statisticalgenetics/rarepedsim', expand = '${ }'
rm -rf ${_output:nn} ${_output} ${_output}.tbi && mkdir -p ${_output:nn}
rarepedsim generate -s ${_genes:a} -c ${out_dir}/${name}.conf -p ${ped_file:a} --num_genes 1 --num_reps 1 -o ${_output:nn} --vcf -b -1 \
&& mv ${_output:nn}/${_genes:bn}/rep1.vcf ${_output:n} && rm -rf ${_output:nn}
bgzip ${_output:n} && tabix -p vcf ${_output}
The next step is to use RV-NPL to generate CHP regional markers for the genotypes in families.
[simulate_2 (CHP)]
output: f'{_input:nn}/MERLIN/{_input:bnn}.CHP.ped'
bash: container = 'statisticalgenetics/rvnpl', environment={'HOME': '/seqlink'}, expand = '${ }', stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
rvnpl collapse --fam ${ped_file} --vcf ${_input} --output ${_input:nn} --freq EVSMAF -c 0.01 --rvhaplo \
&& mv ${_output:nn}.chr*.ped ${_output}
Finally, we use RV-NPL to analyze the CHP genotypes to get the significance of allele-sharing on rare variants in the families.
[simulate_3 (rvnpl)]
output: f'{_input:dd}/pvalue.txt'
bash: container = 'statisticalgenetics/rvnpl', environment={'HOME': '/seqlink'}, expand = '${ }', stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
rvnpl npl --path ${_input:dd} --output ${_output:d} --exact --info_only --perfect --sall --rvibd --n_jobs 8 -c 0.001 --lower_cut 1E-8 --rep 2000000
The p-values for two NPL scores are presented in file pvalue.txt in the corresponding gene folder under the output directory.
Exported from analysis/NPL_Simulation.ipynb committed by Gao Wang on Wed Apr 29 17:57:03 2020 revision 31, 65dc8dc