Rare Variant-Family-Based Score Test for Quantitative Traits

A regional association analysis of rare variants and quantitative traits in family data RV_FamSq <- function(ped_pheno, ped_geno, maf_data, maf_cutoff,parafile,covar_col, trait_col, pop_col=c(), out,kin, estimateAF,start_par)

RV_FamSq(ped_pheno, ped_geno, maf_data, maf_cutoff, covar_col, trait_col,
  pop_col = c(), out, kin, estimateAF, start_par)

Arguments

ped_pheno	a dataframe of phenotype. The first four columns should be family IDs, individual IDs, father IDs and mother IDs of each sample. The ped_pheno can also include columns of covariates and populations IDs. If more than one population groups are included in the analysis, the column of population IDs should indicate which population specific MAF will be used.
ped_geno	a dataframe of genotype. The first two columns are family IDs and individual IDs. The rest columns of the ped_geno are genotype score at each sites. The value of genotype score is 0, 1 or 2, corresponding to number of rare variants observed at each sites. Number of sites included in ped_geno should be equal to the number of rows of `maf_data`.
maf_data	a dataframe describing the information of rare variants included in ped_geno. The first three columns correspond to the name of the gene, chromosome number, and position of variants. The rest columns of maf_data are MAFs of variants in each specific population group obtained from database, e.g., Genome Aggregation Database (gnomAD).
maf_cutoff	the cutoff of MAF for rare variants
covar_col	a integer or a vector of integers indicating which columns of `ped_pheno` will be used as covariants of the analysis. Note that multiple covariants are allowed in the analysis.
trait_col	a integer indicating which columns of `ped_pheno` will be used as traits.
pop_col	a integer indicating which columns of `ped_pheno` represent the population IDs of the sample.
out	the directory that save the results. RVFamSq outputs a dataframe results including the name of the gene, score, p-value of the gene, number of sample size and number of families. In addtion, the values of parameters estimated under the null hyphothesis will be saved under this floder as well with the name "paras.rds"
kin	a kinship matrix calculated either based on family strtuctures or genomic informations.
estimateAF	a dataframe of estimated MAFs that do not oberserved in public datasets. The headers decribing the name of the sub-population should be consistent with the headers in `maf_data`. If this argument is not defined, RVfamSq will estimate the missing MAF based on the dosage of founder.
start_par	an optional argument that defines the starting values of the parameters fitting. User can define specific values to start the parameters fitting, otherwise, the starting values will generate randomly.

Details

RVFamSq performed association analysis by estimating the parameters under the null model and calculating the statistical score using these parameters. The parameters are estimated by maximizing the multivariate normal likelihood: $$L=\prod_{i}(2\pi)^{-n_i / 2}\left |\Omega _i\right |^{-1 /2}e^{\left [ y_i - E(y_i)\right ]'\Omega _i^{-1}\left [ y_i - E(y_i)\right ]}$$ where $n_i$ is the number of individuals in family i and $\left |\Omega _i\right |$ is the determinant of matrix $\Omega _i$. $E(y_i)$ in the above function is a vector of expected phenotype of all individuals within each family and is calculated by: $$E(y_i)=\mu + \beta _x x_i$$ where $x_i$ is a $n_i \times q$ matrix of q covariates included in the association analysis. The parameters $\mu$ and $\beta _x$ are population mean and a vector of covariate effects. The two parameters are estimated by maximizing the above likelihood function. The matrix $\Omega_i$ in the likelihood function is calculated within each family i and defined as: $$\Omega _{ijk}\left\{\begin{matrix}\sigma _g^2 + \sigma_e^2\ \ \ \ \ \ \ \ \ \ \ \ \ if\ j=k\\ 2\varphi _{ijk}\sigma _g^2\ \ \ \ \ \ \ \ \ \ \ \ if\ i\neq k \end{matrix}\right.$$ Here, the parameter $\sigma_g^2$ and$\sigma_e^2$ are variance components that are account for background polygenic effects and environmental effects, respectively. Additionally, $\Omega _{ijk}$ denote the kinship coefficient between individuals j and k. The values of parameters $\sigma_g^2$, $\sigma_e^2$, $\beta_x$, and $\mu$ are estimated by maximizing the above likelihood. With the values of these parameters, we can obtain the variance-covariance matrix $\Omega _i^{base}$ and vector $E(y_i)^{base}$ based on the above equations for each family. Using these two quantities, the statistic score is defined by: $$T^{SCORE}=\frac{\left \{ \sum _i \left [G_i -E(G_i)\right ]'[\Omega _i^{(base)}]^{-1}[y_i-E(y_i)^{(base)}] \right \}^2}{\sum _i[G_i-E(G_i)]'[\Omega_i^{(base)}]^{-1}[G_i-E(G_i)^{(base)}] }$$ where $E(G_i)$ is defined as: $$E(G_i)=\sum_{m=1}^{M}2p_m$$ where $p_m$ is the minor allele frequency (MAF) which is obtain from the dataframe of maf_data.

Data format

Example data is generated for 1,000 extended families with 10 members in each family. The information of RVs within functional region of the gene AAAS with MAF<2 Quantitative traits are generated randomly based on the disribution of N(0,1) and genotypes of samples are generated based on MAF of the gene AAAS.

ped_pheno.txt A phenotype file of 1,000 extended families (10,000 individuals in total). Each column of the file represents family IDs, individual IDs, father IDs, mother IDs, sex, traits, and population IDs, respectively. RVFamSq is capable to analyze multiple covariates and extra covariants should be also included in this file after the fourth column.
ped_geno.txt A genotype file of samples included in the ped_pheno.txt file. The first two columns of the file should be family IDs and individual IDs that have the same format as in ped_pheno.txt. The rest columns of the file are genotypes of each individual and codes as 0, 1, and 2.
AAAS.sfs A file with descriptive information of RVs within the functional region of the gene AAAS. The file contains four columns representing the name of the interested gene, chromosome, position, and MAF of the variant, respectively. If more than one sub-populations are included in the samples, multiple MAFs correspond to each specific population should be included in this file and denoted by headers of population specific symbols

References

Chen, W.-M., and Abecasis, G.R. (2007). Family-Based Association Tests for Genomewide Association Scans. Am. J. Hum. Genet. 81, 913–926.

Examples


library (kinship2)
#> Loading required package: Matrix
#> Loading required package: quadprog
library (RVFamSq)

##  load example data
data_dir<-system.file("extdata", package = "RVFamSq")
ped_pheno<-read.table(phenofile<-paste0(data_dir,"/ped_pheno.txt"))
ped_geno<-read.table(paste0(data_dir, "/ped_geno.txt"))
maf_data<-read.table(paste0(data_dir,"/AAAS.sfs"), header = TRUE)

##  Define output directory that save the results
out<-"results"

## Load phenotype data and calculate the kinship matrix of the pedigree. In this example, we utilize the R package to estimate the kinship of samples based on family structure, but the kinship can also be estimated by genetic variants.
kin_pre<-data.frame(id=ped_pheno[,2], mom=ped_pheno[,4], dad=ped_pheno[,3], sex=ped_pheno[,5])
tped <- with(kin_pre, pedigree(id, dad, mom, sex, famid=ped_pheno[,1]))
kin <- kinship(tped)

## Run RVFamsSq package and calculate the statistical score of the interested gene. If \code{parafile} is not existed, the below command will estimate the parameters using the available data.
## Once you obtained the values of parameters, you can 
## Results can be found under `results/` folder.
RV_FamSq(ped_pheno=ped_pheno, ped_geno=ped_geno, maf_data=maf_data, maf_cutoff =0.02, covar_col=c(5), trait_col=c(6), pop_col=c(7), out=out, kin=kin)
#> Loading required package: stats4
#> [1] "Creat results folder at results"
#> [1] "Starting values of parameters are not defined, generating random initial values now..."
#> [1] "Estimating the values of parameters miu,betax,sitag,sitae"
#>   gene    score         p sample_size family_size
#> 1 AAAS 0.331323 0.5648811       10000        1000
#> NULL