Evaluation of RV-QNPL Through Simulation Studies¶

Aim¶

The aim of this notebook is to display a workflow process to:

simulate genotype data for families, conditional on their quantitative trait (QT) values using RarePedSim
analysis of the simulated data using RV-QNPL to evaluate type I error and power.

Method & Workflow overview¶

RarePedSim is used to simulate genotypes for family data using a linear mean shift model with the output of VCF file. (The parameters for RarePedSim are located in a configuration file generated by workflow shown below)
To generate the collapsed haplotype pattern (CHP) regional markers for pedigrees the "collapse" command from rvnpl is used.
The "qnpl" command from rvnpl is used to analyze the QT and CHP regional markers.

The default workflow can be executed by doing the following command:

sos run analysis/QNPL_Simulation.ipynb

Alternatively, simulations can also be performed so that only a proportion of functional rare variants, e.g., 50%, contribute to disease etiology:

sos run analysis/QNPL_Simulation.ipynb --name quant_Prop50 --proportion 0.5

It is also possible to simulate genotypes under the null (no linkage) by setting the mean shift to 0:

sos run analysis/QNPL_Simulation.ipynb --name quant_null --mean_shift 0.0

Input Data¶

The PED file contains the pedigree structures as well as QT values.
The SFS file containing the variant information for each simulated gene (6-column format: gene, chromosome, position, ref, alt, MAF, function score).

For the SFS files used here, the variant information was downloaded from gnomAD website. Non-Finnish European (NFE) allele frequencies are used in the minor allele frequency (MAF) column. The value for the "function score" column is based on the functional annotations from gnomAD which determines the type of variant, i.e. synonymous, missense. Each SFS file contains variants from each simulated gene respectively.

Global Parameter Setting¶

[global]
# Disease model scenario
parameter: name = 'quant_Prop100'
# proportion of functional variants that contribute to the disease
parameter: proportion = 'None'
# Mean shift value of detrimental rare variants
parameter: mean_shift = 1.0
# model: LOGIT for qualitative traits or LNR for quantitative traits
parameter: model = 'LNR'
# Path to the ped file (6-column PED format),quantitative trait values are standardized
parameter: ped_file = path('data/100extend_quant.ped')
# Path to list of genes
parameter: gene_list = path('data/genes.txt')
# the output directory for VCF file
parameter: out_dir = path('output')

# gene names
genes = paths([f'{gene_list:d}/{x.strip()}.sfs' for x in open(gene_list).readlines()])

Configuration file for disease model¶

First, we will need to generate the configuration file for the disease model which will be used in the simulation software RarePedSim.

Specify various simulation scenarios through the configuration file¶

The configuration file serves as the primary input for RarePedSim to generate simulated genotypes for families with QT values. For QTs, the linear mean-shift ("LNR") model is used to model the effect of causal rare variants on the distribution of QT values. The input PED file contains a column of QT values. These QT values were generated by sampling from a $N(2,1)$ distribution for a subset of family members and from a $N(0,1)$ distribution for the remaining family members, then standardized (mean = 0, standard deviation = 1) and saved in the last column of the PED file. Given the QT values in the pedigree(s) and MAF data obtained from gnomAD, genotypes can be simulated. For example:

When genotypes are simulated under the null, rare variants are not associated with QT values, i.e., meanshift_rare_detrimental=0.0. Genotype are generated using MAF obtained from gnomAD assuming Hardy-Weinberg equilibrium and Mendelian segregation, regardless of QT values.
When genotypes are simulated under the alternative, rare variants are assumed to cause an increase in QT values, i.e., meanshift_rare_detrimental is set to >0. For this example, we set meanshift_rare_detrimental=1 so that rare variant within a gene region increase with the QT value. The distribution of rare variants is $N(\sigma M, \sigma^2)$ where $M$ is total count of detrimental rare variants within a gene. Conditional on pre-specified QT values, the distribution of $M$, denoted as $p(M)$ are estimated and the genotypes are assigned based on Mendelian segregation. For formal derivation of this model please refer to Section 3.3 of RarePedSim Documentation.
When it is desired to simulate different proportions of rare variants being causal (rare variants that increase QT values) proportion_causal can be set be set <1, e.g., proportion_causal=0.5. When all variants are causal we set proportion_causal=1.

Please refer to the Appendix of RarePedSim Documentation for more details on other parameters used in the configuration file.

[make_config: provides = f'{out_dir}/{name}.conf']
# conf file contains the simulation specifications (either Mendelian or Complex, details in RarePedSim doc)
output: f'{out_dir}/{name}.conf'
report: expand=True, output=_output
    trait_type=Complex
    [model]
    model={model}
    [quality control]
    def_rare=0.01
    rare_only=True
    def_neutral=(-1E-5, 1E-5)
    def_protective=(-1, -1E-5)
    [phenotype parameters]
    baseline_effect=0.01
    moi=MAV
    proportion_causal={proportion}
    [LOGIT model]
    OR_rare_detrimental=None
    OR_rare_protective=None
    ORmax_rare_detrimental=None
    ORmin_rare_protective=None
    OR_common_detrimental=None
    OR_common_protective=None
    [LNR model]
    meanshift_rare_detrimental={mean_shift}
    meanshift_rare_protective=None
    meanshiftmax_rare_detrimental=None
    meanshiftmax_rare_protective=None
    meanshift_common_detrimental=None
    meanshift_common_protective=None
    [genotyping artifact]
    missing_low_maf=None
    missing_sites=None
    missing_calls=None
    error_calls=None
    [other]
    max_vars=2
    ascertainment_qualitative=(0,0)
    ascertainment_quantitative=((0,~),(0,~))

Generating Genotypes for Families¶

RarePedSim is used to generate genotypes for families with QT data based on user-specified model in the configuration file.

The output is a VCF file and tabix tool is used to build index in preparation for input to the next step.

[default_1 (RarePedSim simulation of pedigree genotype data)]
depends: f'{out_dir}/{name}.conf'
input: for_each = 'genes'
output: f'{out_dir}/{_genes:bn}.vcf.gz'
bash: container = 'statisticalgenetics/rarepedsim', expand = '${ }'
    rm -rf ${_output:nn} ${_output} ${_output}.tbi && mkdir -p ${_output:nn}
    rarepedsim generate -s ${_genes:a} -c ${out_dir}/${name}.conf -p ${ped_file:a} --num_genes 1 --num_reps 1 -o ${_output:nn} --vcf -b -1 \
    && mv ${_output:nn}/${_genes:bn}/rep1.vcf ${_output:n} && rm -rf ${_output:nn}
    bgzip ${_output:n} && tabix -p vcf ${_output}

Construction of CHP Regional Markers using RV-NPL¶

RV-NPL determines the CHP regional markers using genotypes from the families. Variants from either simulation studies or from sequence data can be used as input data. Here we are using simulated variants from step default_1. Rare variants with MAF below 0.01 are analyzed to construct the CHP regional markers by setting the parameter -c 0.01. The MAF information column is specified at --freq EVSMAF, since only variants with a MAF<0.01 will be analyzed

[default_2 (Computing CHP markers)]
output: f'{_input:nn}/MERLIN/{_input:bnn}.CHP.ped'
bash: container = 'statisticalgenetics/rvnpl', environment={'HOME': '/seqlink'}, expand = '${ }', stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    rvnpl collapse --fam ${ped_file} --vcf ${_input} --output ${_input:nn} --freq EVSMAF -c 0.01 --rvhaplo \
    && mv ${_output:nn}.chr*.ped ${_output}

Performing RV-QNPL Analysis on QT Values and CHP Regional Markers¶

RV-QNPL is used to analyze the CHP genotypes and QT values. Unlike NPL, for RV-NPL only sharing of the minor allele is evaluated.

[default_3 (NPL analysis of rare variants for quantitative traits)]
output: f'{_input:dd}/pvalue.txt'
bash: container = 'statisticalgenetics/rvnpl', environment={'HOME': '/seqlink'}, expand = '${ }', stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    rvnpl qnpl --path ${_input:dd} --output ${_output:d} --exact --rvibd --n_jobs 8 -c 0.001 --lower_cut 1E-8 --rep 2000000

Results¶

The p-values for two QNPL scores are presented in file pvalue.txt in the corresponding gene folder under the output directory.

Additional Notes for Analysis of Sequence data and Age-at-Onset of Alzheimer's Disease¶

For RV-QNPL paper we performed analysis of whole genome sequence data obtained on late-onset Alzheimer’s Disease (LOAD) families from the Alzheimer’s Disease Sequencing Project (ADSP). The workflow is very similar to the analysis of simulated data steps default_2 and default_3 -- you may swap the simulated VCF files with real data in VCF format and perform the analysis. Some details to note:

Quality Control (QC): QC can be the most important step when analyzing sequence data. For the LOAD families, initial QC was performed by the ADSP QC working group and was followed by additional QC at the variant-, genotype-, and family-level.
Annotation: The VCF files from ADSP were annotated with ANNOVAR and the MAF information annotated from gnomAD database using population-specific MAFs. These MAF were used to determine which variants are analyzed, i.e. rare and also to reconstruct haplotypes when there were founders with missing data to determine identical by descent (IBD) sharing. Since the LOAD ADSP families are of European and Caribbean Hispanic ancestry, allele frequencies from NFE and Admixed Americans (AMR), respectively were used.
Phenotype processing: For family members with LOAD, age-at-onset (AAO) was used as the QT. Before analysis AAO was first adjusted for sex and then standardized before analysis.
Analysis: RV-QNPL was used to analyze genes with >1 rare variant site. Frameshift, missense, nonsense and splice site variants with population-specific MAFs <0.01 were analyzed. For missing genotypes, CHP regional markers were constructed using gnomAD allele frequencies that corresponded to the family’s ancestry, i.e. NFE or AMR, by providing the specific MAF information name to parameter “--freq” in RV-QNPL. Joint and individual analyses were performed for the European and Caribbean Hispanic families.