SEQSpark Manual

Zhang Di & Suzanne M. Leal

SEQSpark Manual1. Getting started1.1. Prerequisites1.2. Installation1.3. How to run SEQSpark2. Tutorial2.1. Preparation2.2. Basic quality control2.3. Single variant association testing2.4. Rare variant aggregate tests2.5. Running SEQSpark and collecting the results3. Reference of configurations3.1. root3.2. input3.3. qualityControl3.4. association3.4.1. trait3.4.2. method3.5. meta-analysisAppendicesA.1. Setting-up Spark on a workstationA.2. Setting-up Spark on a cluster

1. Getting started

SEQSpark was developed to analyze large-scale genotype data that consists of many samples and a large number of variants generated from Whole-Genome-Sequencing (WGS) or Exome-Sequencing (ES) project. SEQSpark can also analyze imputed genotype data. In this manual, is described SEQSpark's functionalities in performing anotation, data quality control and association testing.

1.1. Prerequisites

The comiple tool sbt is needed to build SEQSpark. If it is not available in your environment, you can download it here.

Since SEQSpark is based on Spark, to use it you need a computational environment which can run Spark. If you don't have a ready-to-use Spark server/cluster, you can follow the guides in appendices for installation. For more details, please refer to the official websites of Hadoop and Spark.

1.2. Installation

To obtain the source code:

 
dolores@cluster: ~$ git clone https://github.com/statgenetics/seqspark.git

To compile the source code and install:

 
dolores@cluster: seqspark$ ./install --prefix ~/software/seqspark --db-dir /user/dolores/seqspark

If you have never compiled SEQSpark before, it may take some time to download its dependencies. The install script not only compiles the source code, but also downloads the RefSeq and dbSNP databases. For other prebuilt databases, you need to use the script seqspark-db. The CADD database is quite large (65GB), so depending on the network connection it may take quite some time to download.

1.3. How to run SEQSpark

In the configuration file, specify the paths of the input genotype file (VCF) and a phenotype file (TSV), and the pipeline with the parameters you want to run. With the configuration file and all the necessary input files ready, you can perform your analysis as follows:

 
dolores@cluster: ~$ seqspark SingleStudy some.conf [spark-options]

We will explain the parameters that can be set in the configuration file, and how to use them to perform data quality control, annotation and association tests in the following chapters.

2. Tutorial

In this tutorial, we will guide you in the analysis of a simulated dataset of 2000 exomes. It will cover a large range of (but not all) parameters that can be controlled in the configuration file. For a detailed description, please refer to Chapter 3.

2.1. Preparation

After SEQSpark is installed as described in 1.2.

Download the dataset and upload the data into the Hadoop file system (HDFS) for analysis using the put command:

 
dolores@cluster: demo$ wget http://seqspark.statgen.us/simulated.vcf.bz2
dolores@cluster: demo$ hdfs dfs -put simulated.vcf.bz2
dolores@cluster: demo$ wget http://seqspark.statgen.us/simulated.tsv
dolores@cluster: demo$ hdfs dfs -put simulated.tsv

2.2. Basic quality control

Modify the configuration file to specify the desired QC pipeline, adding more parameters as the QC is performed.

Filter genotypes based on DP and GQ score.

First, create a configuration file:

 
dolores@cluster: demo$ vim demo.conf

In the demo file, type or paste the following contents:

 
seqspark {
  project = demo
  pipeline = [ "qualityControl" ]
  input {
    genotype.path = "simulated.vcf.bz2"
    phenotype.path = "simulated.tsv"
  }
  qualityControl {
    genotypes = ["DP >= 8 and GQ >= 20"]
  }
}

Below is an explanation of each of the files parameters:

seqspark {} is mandated and anything outside the curly brackets will be ignored.

project is the name of the run, and it will be used as a suffix when creating cached dataset in HDFS.

pipeline is an array of procedures to be performed, here we will only perform the qualityControl procedure, and its parameters are specified below.

input is the configuration for genotype and phenotype data. You will need to specify the path for the location of these files. By default the location of the files are within HDFS, and the path refers to the location with HDSF. Usually large files, e.g. genotypes are stored within HDFS for the analysis. You can also specify a local path for example the phenotype file as fellows "file:///home/dolores/demo/simulated.tsv". It should be noted that only smaller sized files such as the phenotype file should be stored locally on the server/cluster.

qualityControl contains the parameters for the quality control procedures. In addition to the expression qualityControl {genotypes = ["DP >= 8 and GQ >= 20"]} the term can be expressed as

qualityControl.genotypes=["DP >= 8 and GQ >= 20"]. Additionally instead including "and" between each filter they can simply be listed as follows genotypes = ["DP >= 8", "GQ >= 20"].

Filter variants

Modify the configuration file as follows:

 
seqspark {
  project = demo
  pipeline = [ "qualityControl" ]
  input {
    genotype.path = "simulated.vcf.bz2"
    phenotype.path = "simulated.tsv"
  }
  qualityControl {
    genotypes = ["DP >= 8 and GQ >= 20"]
    variants = ["missingRate <= 0.1"]
  }
}

The only difference here compared to above that an additional filter was added to remove variants are missing 10% or more of their genotype data. Other filters such as hwePvalue can also be used here.

Generate summaries for genotype data

Modify the configuration file as follows:

 
seqspark {
  project = demo
  pipeline = [ "qualityControl" ]
  input {
    genotype.path = "simulated.vcf.bz2"
    phenotype.path = "simulated.tsv"
  }
  qualityControl {
    genotypes = ["DP >= 8 and GQ >= 20"]
    variants = ["missingRate <= 0.1"]
    summaries = ["pca", "titv"]
  }
}

This will perform Principal Component Analysis (PCA) of the genotype data, and calculate Ti/Tv ratios for the variants. Here the default minor allele frequency (MAF) of selecting variants with a MAF>0.01 is used and the default for the number of principal components is 10. Later it will be explained how to select variants with other frequencies as well as prune variants for intermarker linkage disequilbrium (LD). The principal components can be used for data quality control and also loaded to control for substructure and admixture when performing association testing. The results are stored in the output/pca.txt and output/titv.txt files.

2.3. Single variant association testing

To perform association testing, we need to add another section association to the configuration file:

 
seqspark {
  project = demo
  pipeline = [ "qualityControl", "association" ]
  input {
    genotype.path = "simulated.vcf.bz2"
    phenotype.path = "simulated.tsv"
  }
  qualityControl {
    genotypes = ["DP >= 8 and GQ >= 20"]
    variants = ["missingRate <= 0.1"]
    summaries = ["pca", "titv"]
  }
  association {
    trait {
      list = ["bmi", "disease"]
      bmi {
        binary = false
        covariates = ["sex", "age", "disease"]
        pc = 0
      }
      disease {
        binary = true
        covariates = ["sex", "age", "bmi"]
        pc = 0
      }
    }
    method {
      list = [snv]
    }
  }
}

Although it looks lengthy at first, the structure is clear and simple. First, change to the pipeline list to include association. The added association section consists of two subsections, trait and method.

trait describes the phenotypes to be analyzed and the corresponding covariates to include in the regression model. It needs to be specified if the trait is binary (false would denote that the trait is quantitative), and how many principal components to include in the regression analysis. Here a binary trait disease and a quantitative trait BMI are being analyzed.

method describes the methods to use. For single variant associaton analysis snv is used to denote the type of analysis which should be performed. For the single variant association analysis currently the additive test is supported.

2.4. Rare variant aggregate tests

To perform rare variant aggregate tests, the methods are added to the list as shown below:

 
seqspark {
  project = demo
  pipeline = [ "qualityControl", "association" ]
  input {
    genotype.path = "simulated.vcf.bz2"
    phenotype.path = "simulated.tsv"
  }
  qualityControl {
    genotypes = ["DP >= 8 and GQ >= 20"]
    variants = ["missingRate <= 0.1"]
    summaries = ["pca", "titv"]
  }
  association {
    trait {
      list = ["bmi", "disease"]
      bmi {
        binary = false
        covariates = ["sex", "age", "disease"]
        pc = 0
      }
      disease {
        binary = true
        covariates = ["sex", "age", "bmi"]
        pc = 0
      }
    }
    method {
      list = ["snv", "cmc", "brv", "vt", "skat", "skato"]
    }
  }
}

In the list besides snv is now included five rare variant aggregate tests: cmc the Combined Multivariate Collapsing; brv Burden of Rare Variants; vt Variable Threashold; skat Sequence Kernel Assocation Test; and skato SKAT-optimal. It should be noted that for rare variant analysis annotation using refseq is performed before the analysis. By default all splice site, frameshift, missense and nonsense variants for each gene are analyzed that have a . It is possible to annotate using additional databases such as CADD and specfiy which variants to analyze e.g. . Also the user can specify the MAF cut-off to be used in the analysis. The specifics will be covered later in the documentation.

2.5. Running SEQSpark and collecting the results

Now that the configuration file is ready, it can be run with SEQSpark:

 
dolores@cluster demo$ seqspark SingleStudy demo.conf

In the current directory, an output directory with be created, and there you can find:

titv.txt summarizing the Ti/Tv ratio for each sample and for the dataset.

pca.txt contains the principal components values for each sample's genotypes for the 10 components.

A series of assoc_$trait_$method files contains the results of the association analysis for each trait and method combination.

3. Reference of configurations

3.1. root

 
seqspark {
    project = seqspark
    pipeline = [ "qualityControl", "association" ]
    partitions = 1024
}

project: The name of your project

pipeline: The pipeline to run

partitions: The number of spark partitions to use.

3.2. input

 
seqspark {
  input {
    genotype {
      format = vcf
      path = ${seqspark.project}.vcf 
      filters = [] 
      variants = all 
    }
    phenotype {
      path = ${seqspark.project}.tsv
      batch = none
    }
  }
}

genotype.format: The format of the input genotype file, can be either vcf or impute2. When impute2 data is used, the analysis will be based on dosages for both single variant and rare aggregate association analysis.

genotype.path: The path in HDFS for the genotype data file.

genotype.filters:

genotype.variants: Load variants by regions. Valid values are

  1. "all": default, load all variants.
  2. "exome": only load coding variants. Very useful for rare variant aggregate tests using whole-genome data.
  3. comma separated regions: only load variants in these regions. e.g. "chr1","chr1:1-30000000,chr2"

phenotype.path: The path on HDFS or local filesystem for the phenotype data file.

phenotype.batch: To specify the batch when genetic data was generated in batches. Some quality control can be performed by batches to investigate and/or correct for batch specific artifacts or effects. The default is none, which means no batch information available.

3.3. qualityControl

 
seqspark {
  qualityControl {
    genotypes = []
    variants = []
    summaries = []
    pca {
      variants = [
          "informative",
          "missingRate <= 0.1",
          "chr != \"X\"",       
          "chr != \"Y\"",       
          "maf >= 0.01 and maf <= 0.99",
          "hwePvalue >= 13-5"
      ]
      noprune = true
      prune {
        window = 50
        step = 5
        r2 = 0.5
      }
    }
    save = false
    export = false
  }
}

genotypes: The filters for genotypes.

variants: The filters for variants.

summaries: functions to perform for genotype data which include

pca: how to prepare data for PCA

save: save the clean dataset to HDFS in internal format

export: export the clean dataset to HDFS in VCF format

3.4. association

The association section contains two subsections, trait and method.

 
seqspark {
  association {
    trait {
      list = [someTrait]
      someTrait {
        key1 = value1
        ...
      }
    }
    method {
      list = [someMethod]
      someMethod {
        key1 = value1
        ...
      }
    }
  }
}

3.4.1. trait

trait contains a list of phenotypes to analyze and the parameters for each phenotype.

 
someTrait {
  binary = false #true
  covariates = ["c1", "c2", ...]
  pc = 0
}

binary: true for binary trait and false for quantitative trait

covariates: a list of covariates in the phenotype file to be included in the regression model

pc: number of principal components to be included in the regression model

3.4.2. method

method contains a list of methods to use and the parameters for each method.

Single variant test

 
snv {
  test = score #wald
  type = snv
  resampling = false
  maf {
    source = "pooled"
    cutoff = 0.01
  }
}

For single variant association analysis, the following commands can be used

test: Can either perform the score or Wald test

type: must be snv for single variant test

resampling: generate empirical p value when true or analytical p-values when false

maf.source: which source to use to calculate the minor allele frequency. This can be from a database such as ExAC or gnomAD

maf.cutoff: only perform analysis for variants with cutoff. Single variant association tests have very low power for rare variants.

Burden tests

 
burden {
  test = score #wald
  type = cmc #brv
  resampling = false
  weight = none #wss
  maf = {
    source = "pooled"
    cutoff = 0.01
    fixed = true
  }
  misc {
    varLimit = [ 2, 2000 ]
    groupBy = ["gene"]
    variants = ["isFunctional"]
    weightBeta = [1, 25]
  }
}

The resampling and maf.source options are the same as for the single variant analysis. The maf.cutoff specificies the frequency of the variants to be used in the analysis cutoff unlike for snv analysis it is of interest to analyze rare variants.

type: the coding style.

weight: Variant weights can be used for all tests except for CMC

maf.fixed: use fixed MAF cutoff to perform the cmc or brv or to perform the variable threshold (VT) - set maf.fixed to false. The type of coding which can be used for VT can be either cmc or brv. For VT, resampling will be set to true as the default.

misc: more options for rare variant tests

SKAT/SKAT-O tests

 
skato {
  type = skato #skat
  weight = skat
  maf {
    source = pooled
    cutoff = 0.01
  }
  misc {. 
    varLimit = 2000
    groupBy = ["gene"]
    variants = ["isFunctional"]
    method = "optimal.adj"
    rhos = []
    kernel = "linear.weighted"
    weightBeta = [1, 25]
  }
}

type can be skat or skato. weight and maf are similar to burden tests, except that set maf.fixed = false and test has no effect. In the misc, method, rhos, kernel, and weightBeta are similar to their counterparts in the original SKAT R package. For more detail please see the R package documentation.

Summary statistics for meta-analysis

 
sum {
  type = sum
  maf.source = pooled
  misc {
    varLimit = 2000
    groupBy = ["slidingWindow", 1000000, 0]
    variants = []
  }
}

Set type = sum to generate the score statistics and the covariance matrix. Other parameters are similar to the above association analysis methods. By default, set groupBy = ["slidingWindow", 1000000, 0] to generate summary statistics for variants in every 1 Mbp region. The summary statistics can later be used to perform single variant or rare variant aggregate meta-analysis.

3.5. meta-analysis

In addition to performing association analysis for a single study, SEQSpark can perform meta-analysis using summary statistics generated from single study analyses.

 
seqspark {
  project = someName
  partitions = 1024
  meta {
    studies = ["study1", "study2"]
    study1 {
      format = "seqspark"
      path = "hdfs:///user/dolores/data/study1.meta"
    }
    study2 {
      format = "seqspark"
      path = "hdfs://user/dolores/data/study2.meta"
    }
    method {
      list = ["methodName"]
      methodName {
        key1 = value1
        ...
      }
    }
  }
}

project: project name, same as the single study analysis

partitions: number of Spark partitions

meta: parameters controls how to perform meta-analysis

To run meta-analysis, use the command as follows:

 
dolores@cluster demo$ seqspark MetaAnalysis meta.conf

Appendices

A.1. Setting-up Spark on a workstation

0. preparation

Suppose the workstation has six spare hard drives, 16 CPU cores, and 64 GB memory. The following is observed after formating and mounting the hard drives separately:

 
dolores@workstation: ~$ df -Th
Filesystem     Type      Size  Used Avail Use% Mounted on
/dev/sda1      ext4      459G  114G  345G  25% /mnt/hadoop/disk1
/dev/sdb1      ext4      459G  113G  346G  25% /mnt/hadoop/disk2
/dev/sdc1      ext4      459G  113G  347G  25% /mnt/hadoop/disk3
/dev/sdd1      ext4      459G  112G  347G  25% /mnt/hadoop/disk4
/dev/sde1      ext4      459G  114G  346G  25% /mnt/hadoop/disk5
/dev/sdf1      ext4      459G  113G  346G  25% /mnt/hadoop/disk6
...

Make sure you can ssh to the localhost without typing the password. If you cannot, set it up as follows:

 
dolores@workstation: ~$ ssh-keygen
dolores@workstation: ~$ cat ~/.ssh/id_rsa.pub >> ~/.authorized_keys
dolores@workstation: ~$ chmod 600 ~/.authorized_keys
dolores@workstation: ~$ ssh localhost
dolores@workstation: ~$ exit

Download the suitable version of JDK for your system here, and then install it:

 
dolores@workstation: ~$ tar xf jdk-8u121-linux-x64.tar.gz
dolores@workstation: ~$ sudo cp -r jdk1.8.0_121 /opt/jdk1.8.0_121

1. install and set up Hadoop

Download and install:

 
dolores@workstation: ~$ wget http://mirror.olnevhost.net/pub/apache/hadoop/common/hadoop-2.6.5/hadoop-2.6.5.tar.gz
dolores@workstation: ~$ tar xf hadoop-2.6.5.tar.gz
dolores@workstation: ~$ sudo mv hadoop-2.6.5 /opt

Configure Hadoop:

 
dolores@workstation: ~$ cd /opt/hadoop-2.6.5/
dolores@workstation: hadoop-2.6.5$ mkdir data
dolores@workstation: cd etc/hadoop/

​ Edit the following files that are distributed with Hadoop as follows:

Format HDFS and start the service:

 
dolores@workstation: hadoop$ cd ../..
dolores@workstation: hadoop-2.6.5$ ./bin/hdfs namenode -format
dolores@workstation: hadoop-2.6.5$ ./sbin/start-dfs.sh

Check if it is working properly:

 
dolores@workstation: hadoop-2.6.5$ ./bin/hdfs dfs -mkdir -p /user/dolores
dolores@workstation: hadoop-2.6.5$ ./bin/hdfs dfs -put etc/hadoop/core-site.xml
dolores@workstation: hadoop-2.6.5$ ./bin/hdfs dfs -cat core-site.xml
dolores@workstation: hadoop-2.6.5$ ./bin/hdfs dfs -rm core-site.xml

Add the following to the file ~/.profile and re-login:

 
export PATH=/opt/hadoop-2.6.5/bin:$PATH

2. install and configure Spark

Download and install:

 
dolores@workstation: ~$ wget http://mirror.cogentco.com/pub/apache/spark/spark-2.1.0/spark-2.1.0-bin-without-hadoop.tgz
dolores@workstation: ~$ tar xf spark-2.1.0-bin-without-hadoop.tgz
dolores@workstation: ~$ sudo mv spark-2.1.0-bin-without-hadoop /opt/spark-2.1.0

Configure Spark:

 
dolores@workstation: ~$ cd /opt/spark-2.1.0/conf
dolores@workstation: conf$ cp spark-env.sh.template spark-env.sh
dolores@workstation: conf$ cp spark-defaults.conf.template spark-defaults.conf
dolores@workstation: conf$ cp log4j.properties.template log4j.properties

​ Edit the following files that are distributed with Spark as follows:

Start Spark service:

 
dolores@workstation: conf$ cd /opt/spark-2.1.0
dolores@workstation: spark-2.1.0$ ./sbin/start-all.sh

Check if it is working properly:

 
dolores@workstation: spark-2.1.0$ ./bin/spark-shell

Add the following to the file ~/.profile and re-login:

 
export PATH=/opt/spark-2.1.0/bin:$PATH

A.2. Setting-up Spark on a cluster

0. preparation

Suppose the cluster to be set-up consist of 1 master server node0 and 8 slave servers node[1-8]. Each slave server has 3 spare hard drives, 16 CPU cores, 64 GB memory. Format and mount the hard drives separately. The following is observed for each slave server:

 
dolores@node1: ~$ df -Th
Filesystem     Type      Size  Used Avail Use% Mounted on
/dev/sda1      ext4      459G  114G  345G  25% /mnt/hadoop/disk1
/dev/sdb1      ext4      459G  113G  346G  25% /mnt/hadoop/disk2
/dev/sdc1      ext4      459G  113G  347G  25% /mnt/hadoop/disk3

Make sure you can ssh to any of the nodes without typing the password.

Download the suitable version of JDK for your system here, and then install it for all nodes as follows:

 
dolores@node0: ~$ tar xf jdk-8u121-linux-x64.tar.gz
dolores@node0: ~$ sudo cp -r jdk1.8.0_121 /opt/jdk1.8.0_121

1. install and set up Hadoop

For all nodes, download and install:

 
dolores@node0: ~$ wget http://mirror.olnevhost.net/pub/apache/hadoop/common/hadoop-2.6.5/hadoop-2.6.5.tar.gz
dolores@node0: ~$ tar xf hadoop-2.6.5.tar.gz
dolores@node0: ~$ sudo mv hadoop-2.6.5 /opt

Configure Hadoop:

 
dolores@node0: ~$ cd /opt/hadoop-2.6.5/
dolores@node0: hadoop-2.6.5$ mkdir data
dolores@node0: cd etc/hadoop/

​ For all nodes, edit the following files that are distributed with Hadoop as follows:

Format HDFS and start the service:

 
dolores@node0: hadoop$ cd ../..
dolores@node0: hadoop-2.6.5$ ./bin/hdfs namenode -format
dolores@node0: hadoop-2.6.5$ ./sbin/start-dfs.sh

Check if it is working properly:

 
dolores@node0: hadoop-2.6.5$ ./bin/hdfs dfs -mkdir -p /user/dolores
dolores@node0: hadoop-2.6.5$ ./bin/hdfs dfs -put etc/hadoop/core-site.xml
dolores@node0: hadoop-2.6.5$ ./bin/hdfs dfs -cat core-site.xml
dolores@node0: hadoop-2.6.5$ ./bin/hdfs dfs -rm core-site.xml

Add the following to the file ~/.profile and re-login:

 
export PATH=/opt/hadoop-2.6.5/bin:$PATH

2. install and configure Spark

For all nodes, Download and install:

 
dolores@node0: ~$ wget http://mirror.cogentco.com/pub/apache/spark/spark-2.1.0/spark-2.1.0-bin-without-hadoop.tgz
dolores@node0: ~$ tar xf spark-2.1.0-bin-without-hadoop.tgz
dolores@node0: ~$ sudo mv spark-2.1.0-bin-without-hadoop /opt/spark-2.1.0

Configure Spark:

 
dolores@node0: ~$ cd /opt/spark-2.1.0/conf
dolores@node0: conf$ cp spark-env.sh.template spark-env.sh
dolores@node0: conf$ cp spark-defaults.conf.template spark-defaults.conf
dolores@node0: conf$ cp log4j.properties.template log4j.properties

​ For all nodes, edit the following files:

Start Spark service:

 
dolores@node0: conf$ cd /opt/spark-2.1.0
dolores@node0: spark-2.1.0$ ./sbin/start-all.sh

Check if it is working properly:

 
dolores@node0: spark-2.1.0$ ./bin/spark-shell

Add the following to the file ~/.profile and re-login:

 
export PATH=/opt/spark-2.1.0/bin:$PATH