The aim of this notebook is to display a workflow process to:
RarePedSim is used to simulate genotypes for family data using a linear mean shift model with the output of VCF file. (The parameters for RarePedSim are located in a configuration file generated by workflow shown below)
To generate the collapsed haplotype pattern (CHP) regional markers for pedigrees the "collapse" command from rvnpl is used.
The "qnpl" command from rvnpl is used to analyze the QT and CHP regional markers.
The default workflow can be executed by doing the following command:
sos run analysis/QNPL_Simulation.ipynb
Alternatively, simulations can also be performed so that only a proportion of functional rare variants, e.g., 50%, contribute to disease etiology:
sos run analysis/QNPL_Simulation.ipynb --name quant_Prop50 --proportion 0.5
It is also possible to simulate genotypes under the null (no linkage) by setting the mean shift to 0:
sos run analysis/QNPL_Simulation.ipynb --name quant_null --mean_shift 0.0
For the SFS files used here, the variant information was downloaded from gnomAD website. Non-Finnish European (NFE) allele frequencies are used in the minor allele frequency (MAF) column. The value for the "function score" column is based on the functional annotations from gnomAD which determines the type of variant, i.e. synonymous, missense. Each SFS file contains variants from each simulated gene respectively.
[global]
# Disease model scenario
parameter: name = 'quant_Prop100'
# proportion of functional variants that contribute to the disease
parameter: proportion = 'None'
# Mean shift value of detrimental rare variants
parameter: mean_shift = 1.0
# model: LOGIT for qualitative traits or LNR for quantitative traits
parameter: model = 'LNR'
# Path to the ped file (6-column PED format),quantitative trait values are standardized
parameter: ped_file = path('data/100extend_quant.ped')
# Path to list of genes
parameter: gene_list = path('data/genes.txt')
# the output directory for VCF file
parameter: out_dir = path('output')
# gene names
genes = paths([f'{gene_list:d}/{x.strip()}.sfs' for x in open(gene_list).readlines()])
First, we will need to generate the configuration file for the disease model which will be used in the simulation software RarePedSim.
The configuration file serves as the primary input for RarePedSim to generate simulated genotypes for families with QT values. For QTs, the linear mean-shift ("LNR") model is used to model the effect of causal rare variants on the distribution of QT values. The input PED file contains a column of QT values. These QT values were generated by sampling from a $N(2,1)$ distribution for a subset of family members and from a $N(0,1)$ distribution for the remaining family members, then standardized (mean = 0, standard deviation = 1) and saved in the last column of the PED file. Given the QT values in the pedigree(s) and MAF data obtained from gnomAD, genotypes can be simulated. For example:
When genotypes are simulated under the null, rare variants are not associated with QT values, i.e., meanshift_rare_detrimental=0.0. Genotype are generated using MAF obtained from gnomAD assuming Hardy-Weinberg equilibrium and Mendelian segregation, regardless of QT values.
When genotypes are simulated under the alternative, rare variants are assumed to cause an increase in QT values, i.e., meanshift_rare_detrimental is set to >0. For this example, we set meanshift_rare_detrimental=1 so that rare variant within a gene region increase with the QT value. The distribution of rare variants is $N(\sigma M, \sigma^2)$ where $M$ is total count of detrimental rare variants within a gene. Conditional on pre-specified QT values, the distribution of $M$, denoted as $p(M)$ are estimated and the genotypes are assigned based on Mendelian segregation. For formal derivation of this model please refer to Section 3.3 of RarePedSim Documentation.
proportion_causal can be set be set <1, e.g., proportion_causal=0.5. When all variants are causal we set proportion_causal=1.Please refer to the Appendix of RarePedSim Documentation for more details on other parameters used in the configuration file.
[make_config: provides = f'{out_dir}/{name}.conf']
# conf file contains the simulation specifications (either Mendelian or Complex, details in RarePedSim doc)
output: f'{out_dir}/{name}.conf'
report: expand=True, output=_output
trait_type=Complex
[model]
model={model}
[quality control]
def_rare=0.01
rare_only=True
def_neutral=(-1E-5, 1E-5)
def_protective=(-1, -1E-5)
[phenotype parameters]
baseline_effect=0.01
moi=MAV
proportion_causal={proportion}
[LOGIT model]
OR_rare_detrimental=None
OR_rare_protective=None
ORmax_rare_detrimental=None
ORmin_rare_protective=None
OR_common_detrimental=None
OR_common_protective=None
[LNR model]
meanshift_rare_detrimental={mean_shift}
meanshift_rare_protective=None
meanshiftmax_rare_detrimental=None
meanshiftmax_rare_protective=None
meanshift_common_detrimental=None
meanshift_common_protective=None
[genotyping artifact]
missing_low_maf=None
missing_sites=None
missing_calls=None
error_calls=None
[other]
max_vars=2
ascertainment_qualitative=(0,0)
ascertainment_quantitative=((0,~),(0,~))
RarePedSim is used to generate genotypes for families with QT data based on user-specified model in the configuration file.
The output is a VCF file and tabix tool is used to build index in preparation for input to the next step.
[default_1 (RarePedSim simulation of pedigree genotype data)]
depends: f'{out_dir}/{name}.conf'
input: for_each = 'genes'
output: f'{out_dir}/{_genes:bn}.vcf.gz'
bash: container = 'statisticalgenetics/rarepedsim', expand = '${ }'
rm -rf ${_output:nn} ${_output} ${_output}.tbi && mkdir -p ${_output:nn}
rarepedsim generate -s ${_genes:a} -c ${out_dir}/${name}.conf -p ${ped_file:a} --num_genes 1 --num_reps 1 -o ${_output:nn} --vcf -b -1 \
&& mv ${_output:nn}/${_genes:bn}/rep1.vcf ${_output:n} && rm -rf ${_output:nn}
bgzip ${_output:n} && tabix -p vcf ${_output}
RV-NPL determines the CHP regional markers using genotypes from the families. Variants from either simulation studies or from sequence data can be used as input data. Here we are using simulated variants from step default_1. Rare variants with MAF below 0.01 are analyzed to construct the CHP regional markers by setting the parameter -c 0.01. The MAF information column is specified at --freq EVSMAF, since only variants with a MAF<0.01 will be analyzed
[default_2 (Computing CHP markers)]
output: f'{_input:nn}/MERLIN/{_input:bnn}.CHP.ped'
bash: container = 'statisticalgenetics/rvnpl', environment={'HOME': '/seqlink'}, expand = '${ }', stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
rvnpl collapse --fam ${ped_file} --vcf ${_input} --output ${_input:nn} --freq EVSMAF -c 0.01 --rvhaplo \
&& mv ${_output:nn}.chr*.ped ${_output}
RV-QNPL is used to analyze the CHP genotypes and QT values. Unlike NPL, for RV-NPL only sharing of the minor allele is evaluated.
[default_3 (NPL analysis of rare variants for quantitative traits)]
output: f'{_input:dd}/pvalue.txt'
bash: container = 'statisticalgenetics/rvnpl', environment={'HOME': '/seqlink'}, expand = '${ }', stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
rvnpl qnpl --path ${_input:dd} --output ${_output:d} --exact --rvibd --n_jobs 8 -c 0.001 --lower_cut 1E-8 --rep 2000000
The p-values for two QNPL scores are presented in file pvalue.txt in the corresponding gene folder under the output directory.
For RV-QNPL paper we performed analysis of whole genome sequence data obtained on late-onset Alzheimer’s Disease (LOAD) families from the Alzheimer’s Disease Sequencing Project (ADSP). The workflow is very similar to the analysis of simulated data steps default_2 and default_3 -- you may swap the simulated VCF files with real data in VCF format and perform the analysis. Some details to note:
Exported from analysis/QNPL_Simulation.ipynb committed by Gao Wang on Thu May 21 14:50:15 2020 revision 11, ba71536