Zhang Di & Suzanne M. Leal
SEQSpark Manual1. Getting started1.1. Prerequisites1.2. Installation1.3. How to run SEQSpark2. Tutorial2.1. Preparation2.2. Basic quality control2.3. Single variant association testing2.4. Rare variant aggregate tests2.5. Running SEQSpark and collecting the results3. Reference of configurations3.1. root3.2. input3.3. qualityControl3.4. association3.4.1. trait3.4.2. method3.5. meta-analysisAppendicesA.1. Setting-up Spark on a workstationA.2. Setting-up Spark on a cluster
SEQSpark was developed to analyze large-scale genotype data that consists of many samples and a large number of variants generated from Whole-Genome-Sequencing (WGS) or Exome-Sequencing (ES) project. SEQSpark can also analyze imputed genotype data. In this manual, is described SEQSpark's functionalities in performing anotation, data quality control and association testing.
The comiple tool sbt is needed to build SEQSpark. If it is not available in your environment, you can download it here.
Since SEQSpark is based on Spark, to use it you need a computational environment which can run Spark. If you don't have a ready-to-use Spark server/cluster, you can follow the guides in appendices for installation. For more details, please refer to the official websites of Hadoop and Spark.
To obtain the source code:
dolores@cluster: ~$ git clone https://github.com/statgenetics/seqspark.gitTo compile the source code and install:
dolores@cluster: seqspark$ ./install --prefix ~/software/seqspark --db-dir /user/dolores/seqsparkIf you have never compiled SEQSpark before, it may take some time to download its dependencies. The install script not only compiles the source code, but also downloads the RefSeq and dbSNP databases. For other prebuilt databases, you need to use the script seqspark-db. The CADD database is quite large (65GB), so depending on the network connection it may take quite some time to download.
In the configuration file, specify the paths of the input genotype file (VCF) and a phenotype file (TSV), and the pipeline with the parameters you want to run. With the configuration file and all the necessary input files ready, you can perform your analysis as follows:
dolores@cluster: ~$ seqspark SingleStudy some.conf [spark-options]We will explain the parameters that can be set in the configuration file, and how to use them to perform data quality control, annotation and association tests in the following chapters.
In this tutorial, we will guide you in the analysis of a simulated dataset of 2000 exomes. It will cover a large range of (but not all) parameters that can be controlled in the configuration file. For a detailed description, please refer to Chapter 3.
After SEQSpark is installed as described in 1.2.
Download the dataset and upload the data into the Hadoop file system (HDFS) for analysis using the put command:
dolores@cluster: demo$ wget http://seqspark.statgen.us/simulated.vcf.bz2dolores@cluster: demo$ hdfs dfs -put simulated.vcf.bz2dolores@cluster: demo$ wget http://seqspark.statgen.us/simulated.tsvdolores@cluster: demo$ hdfs dfs -put simulated.tsvModify the configuration file to specify the desired QC pipeline, adding more parameters as the QC is performed.
Filter genotypes based on DP and GQ score.
First, create a configuration file:
dolores@cluster: demo$ vim demo.confIn the demo file, type or paste the following contents:
seqspark { project = demo pipeline = [ "qualityControl" ] input { genotype.path = "simulated.vcf.bz2" phenotype.path = "simulated.tsv" } qualityControl { genotypes = ["DP >= 8 and GQ >= 20"] }}Below is an explanation of each of the files parameters:
seqspark {} is mandated and anything outside the curly brackets will be ignored.
project is the name of the run, and it will be used as a suffix when creating cached dataset in HDFS.
pipeline is an array of procedures to be performed, here we will only perform the qualityControl procedure, and its parameters are specified below.
input is the configuration for genotype and phenotype data. You will need to specify the path for the location of these files. By default the location of the files are within HDFS, and the path refers to the location with HDSF. Usually large files, e.g. genotypes are stored within HDFS for the analysis. You can also specify a local path for example the phenotype file as fellows "file:///home/dolores/demo/simulated.tsv". It should be noted that only smaller sized files such as the phenotype file should be stored locally on the server/cluster.
qualityControl contains the parameters for the quality control procedures. In addition to the expression qualityControl {genotypes = ["DP >= 8 and GQ >= 20"]} the term can be expressed as
qualityControl.genotypes=["DP >= 8 and GQ >= 20"]. Additionally instead including "and" between each filter they can simply be listed as follows genotypes = ["DP >= 8", "GQ >= 20"].
Filter variants
Modify the configuration file as follows:
seqspark { project = demo pipeline = [ "qualityControl" ] input { genotype.path = "simulated.vcf.bz2" phenotype.path = "simulated.tsv" } qualityControl { genotypes = ["DP >= 8 and GQ >= 20"] variants = ["missingRate <= 0.1"] }}The only difference here compared to above that an additional filter was added to remove variants are missing 10% or more of their genotype data. Other filters such as hwePvalue can also be used here.
Generate summaries for genotype data
Modify the configuration file as follows:
seqspark { project = demo pipeline = [ "qualityControl" ] input { genotype.path = "simulated.vcf.bz2" phenotype.path = "simulated.tsv" } qualityControl { genotypes = ["DP >= 8 and GQ >= 20"] variants = ["missingRate <= 0.1"] summaries = ["pca", "titv"] }}This will perform Principal Component Analysis (PCA) of the genotype data, and calculate Ti/Tv ratios for the variants. Here the default minor allele frequency (MAF) of selecting variants with a MAF>0.01 is used and the default for the number of principal components is 10. Later it will be explained how to select variants with other frequencies as well as prune variants for intermarker linkage disequilbrium (LD). The principal components can be used for data quality control and also loaded to control for substructure and admixture when performing association testing. The results are stored in the output/pca.txt and output/titv.txt files.
To perform association testing, we need to add another section association to the configuration file:
seqspark { project = demo pipeline = [ "qualityControl", "association" ] input { genotype.path = "simulated.vcf.bz2" phenotype.path = "simulated.tsv" } qualityControl { genotypes = ["DP >= 8 and GQ >= 20"] variants = ["missingRate <= 0.1"] summaries = ["pca", "titv"] } association { trait { list = ["bmi", "disease"] bmi { binary = false covariates = ["sex", "age", "disease"] pc = 0 } disease { binary = true covariates = ["sex", "age", "bmi"] pc = 0 } } method { list = [snv] } }}Although it looks lengthy at first, the structure is clear and simple. First, change to the pipeline list to include association. The added association section consists of two subsections, trait and method.
trait describes the phenotypes to be analyzed and the corresponding covariates to include in the regression model. It needs to be specified if the trait is binary (false would denote that the trait is quantitative), and how many principal components to include in the regression analysis. Here a binary trait disease and a quantitative trait BMI are being analyzed.
method describes the methods to use. For single variant associaton analysis snv is used to denote the type of analysis which should be performed. For the single variant association analysis currently the additive test is supported.
To perform rare variant aggregate tests, the methods are added to the list as shown below:
seqspark { project = demo pipeline = [ "qualityControl", "association" ] input { genotype.path = "simulated.vcf.bz2" phenotype.path = "simulated.tsv" } qualityControl { genotypes = ["DP >= 8 and GQ >= 20"] variants = ["missingRate <= 0.1"] summaries = ["pca", "titv"] } association { trait { list = ["bmi", "disease"] bmi { binary = false covariates = ["sex", "age", "disease"] pc = 0 } disease { binary = true covariates = ["sex", "age", "bmi"] pc = 0 } } method { list = ["snv", "cmc", "brv", "vt", "skat", "skato"] } }}In the list besides snv is now included five rare variant aggregate tests: cmc the Combined Multivariate Collapsing; brv Burden of Rare Variants; vt Variable Threashold; skat Sequence Kernel Assocation Test; and skato SKAT-optimal. It should be noted that for rare variant analysis annotation using refseq is performed before the analysis. By default all splice site, frameshift, missense and nonsense variants for each gene are analyzed that have a . It is possible to annotate using additional databases such as CADD and specfiy which variants to analyze e.g. . Also the user can specify the MAF cut-off to be used in the analysis. The specifics will be covered later in the documentation.
Now that the configuration file is ready, it can be run with SEQSpark:
dolores@cluster demo$ seqspark SingleStudy demo.confIn the current directory, an output directory with be created, and there you can find:
titv.txt summarizing the Ti/Tv ratio for each sample and for the dataset.
pca.txt contains the principal components values for each sample's genotypes for the 10 components.
A series of assoc_$trait_$method files contains the results of the association analysis for each trait and method combination.
seqspark { project = seqspark pipeline = [ "qualityControl", "association" ] partitions = 1024}project: The name of your project
pipeline: The pipeline to run
partitions: The number of spark partitions to use.
seqspark { input { genotype { format = vcf path = ${seqspark.project}.vcf filters = [] variants = all } phenotype { path = ${seqspark.project}.tsv batch = none } }}genotype.format: The format of the input genotype file, can be either vcf or impute2. When impute2 data is used, the analysis will be based on dosages for both single variant and rare aggregate association analysis.
genotype.path: The path in HDFS for the genotype data file.
genotype.filters:
filters = ["FILTER == \"PASS\"", "INFO.AN > 1800"]filters = ["INFO >= 0.8"]genotype.variants: Load variants by regions. Valid values are
"chr1","chr1:1-30000000,chr2" phenotype.path: The path on HDFS or local filesystem for the phenotype data file.
phenotype.batch: To specify the batch when genetic data was generated in batches. Some quality control can be performed by batches to investigate and/or correct for batch specific artifacts or effects. The default is none, which means no batch information available.
seqspark { qualityControl { genotypes = [] variants = [] summaries = [] pca { variants = [ "informative", "missingRate <= 0.1", "chr != \"X\"", "chr != \"Y\"", "maf >= 0.01 and maf <= 0.99", "hwePvalue >= 13-5" ] noprune = true prune { window = 50 step = 5 r2 = 0.5 } } save = false export = false }}genotypes: The filters for genotypes.
variants: The filters for variants.
chr: chromosomemaf: alternative allele frequencyinformative: not a monomorphic sitemissingRate: overall missing ratebatchMissingRate: the maximum batch specific missing ratebatchSpecific: the batch specific variants cutoffisFunctional: the variant site is functional i.e. splice site, missense, frameshift and nonsensehwePvalue: the P value of HWE test.summaries: functions to perform for genotype data which include
pca: perform PCA analysistitv: calculate Ti/Tv ratiopca: how to prepare data for PCA
pca.variants: variants inclusion standard for PCA. For example exclude variants on the X and Y chromosomespca.noprune: do not prune SNP based on LDpca.prune: LD prune parameters, similar to the PLINK pare-wise LD pruning.save: save the clean dataset to HDFS in internal format
export: export the clean dataset to HDFS in VCF format
The association section contains two subsections, trait and method.
seqspark { association { trait { list = [someTrait] someTrait { key1 = value1 } } method { list = [someMethod] someMethod { key1 = value1 } } }}trait contains a list of phenotypes to analyze and the parameters for each phenotype.
someTrait { binary = false #true covariates = ["c1", "c2", ] pc = 0}binary: true for binary trait and false for quantitative trait
covariates: a list of covariates in the phenotype file to be included in the regression model
pc: number of principal components to be included in the regression model
method contains a list of methods to use and the parameters for each method.
Single variant test
snv { test = score #wald type = snv resampling = false maf { source = "pooled" cutoff = 0.01 }}For single variant association analysis, the following commands can be used
test: Can either perform the score or Wald test
type: must be snv for single variant test
resampling: generate empirical p value when true or analytical p-values when false
maf.source: which source to use to calculate the minor allele frequency. This can be from a database such as ExAC or gnomAD
"pooled": using all samples"controls": using only controls, this is only valid when the trait is binary, otherwise pooled sample will be used. It is not recommened to use controls since it has been shown to increase type I error."some_key": using "some_key" to extract value from the INFO field. The value must be a number between 0 and 1. otherwise, pooled sample will be used.maf.cutoff: only perform analysis for variants with cutoff. Single variant association tests have very low power for rare variants.
Burden tests
burden { test = score #wald type = cmc #brv resampling = false weight = none #wss maf = { source = "pooled" cutoff = 0.01 fixed = true } misc { varLimit = [ 2, 2000 ] groupBy = ["gene"] variants = ["isFunctional"] weightBeta = [1, 25] }}The resampling and maf.source options are the same as for the single variant analysis. The maf.cutoff specificies the frequency of the variants to be used in the analysis cutoff unlike for snv analysis it is of interest to analyze rare variants.
type: the coding style.
cmc: Combined Multivariate Collapsing method - uses binary coding - (1) one or more rare variants are included in the region or (0) no rare variantsbrv: Burden of Rare Variants - analyzes the number of variants within a regionweight: Variant weights can be used for all tests except for CMC
wss: none: no weight, or the weights are equal for all variantsskat: weighted style for SKAT - , where and are 1 and 25 by default, and can be specified in misc."some_key": use "some_key" in the INFO field of the VCFmaf.fixed: use fixed MAF cutoff to perform the cmc or brv or to perform the variable threshold (VT) - set maf.fixed to false. The type of coding which can be used for VT can be either cmc or brv. For VT, resampling will be set to true as the default.
misc: more options for rare variant tests
misc.varLimit: skip regions/genes with variants less than the lower bound and more than the upper bound. The default is varLimit = [ 2, 2000 ]misc.groupBy: group variants by genes. Can also be grouped by ["slidingWindow", size, overlap] or other regulatory regions. For example using annotation from ENCODEmisc.variants: only include variants that satisfy these filters. Annotations from various bioinformatic tools/databases can be used. e.g. ["CADD.score >= 3"]misc.weightBeta: the and values for skat style weightSKAT/SKAT-O tests
skato { type = skato #skat weight = skat maf { source = pooled cutoff = 0.01 } misc {. varLimit = 2000 groupBy = ["gene"] variants = ["isFunctional"] method = "optimal.adj" rhos = [] kernel = "linear.weighted" weightBeta = [1, 25] }}type can be skat or skato. weight and maf are similar to burden tests, except that set maf.fixed = false and test has no effect. In the misc, method, rhos, kernel, and weightBeta are similar to their counterparts in the original SKAT R package. For more detail please see the R package documentation.
Summary statistics for meta-analysis
sum { type = sum maf.source = pooled misc { varLimit = 2000 groupBy = ["slidingWindow", 1000000, 0] variants = [] }}Set type = sum to generate the score statistics and the covariance matrix. Other parameters are similar to the above association analysis methods. By default, set groupBy = ["slidingWindow", 1000000, 0] to generate summary statistics for variants in every 1 Mbp region. The summary statistics can later be used to perform single variant or rare variant aggregate meta-analysis.
In addition to performing association analysis for a single study, SEQSpark can perform meta-analysis using summary statistics generated from single study analyses.
seqspark { project = someName partitions = 1024 meta { studies = ["study1", "study2"] study1 { format = "seqspark" path = "hdfs:///user/dolores/data/study1.meta" } study2 { format = "seqspark" path = "hdfs://user/dolores/data/study2.meta" } method { list = ["methodName"] methodName { key1 = value1 } } }}project: project name, same as the single study analysis
partitions: number of Spark partitions
meta: parameters controls how to perform meta-analysis
studies: a list of studies
someStudy: contains format and path of the summary statistics of the study
someStudy.format: now only supports the results from SEQSpark single study analysis, but will be expanded to additional formatssomeStudy.path: the location of the summary statisticsmethod: parameters controls the methods to use
method.list: a list of methods to use, e.g. SKAT, BRV etc. method.methodName: The name of the methods is provided, e.g. method.skat. The analysis used for for meta-analysis is similiar to single study association analysis with the exception: permuation based analysis cannot be used, since it is based on summary statistics, e.g. VT, is not available. To run meta-analysis, use the command as follows:
dolores@cluster demo$ seqspark MetaAnalysis meta.conf0. preparation
Suppose the workstation has six spare hard drives, 16 CPU cores, and 64 GB memory. The following is observed after formating and mounting the hard drives separately:
dolores@workstation: ~$ df -ThFilesystem Type Size Used Avail Use% Mounted on/dev/sda1 ext4 459G 114G 345G 25% /mnt/hadoop/disk1/dev/sdb1 ext4 459G 113G 346G 25% /mnt/hadoop/disk2/dev/sdc1 ext4 459G 113G 347G 25% /mnt/hadoop/disk3/dev/sdd1 ext4 459G 112G 347G 25% /mnt/hadoop/disk4/dev/sde1 ext4 459G 114G 346G 25% /mnt/hadoop/disk5/dev/sdf1 ext4 459G 113G 346G 25% /mnt/hadoop/disk6...Make sure you can ssh to the localhost without typing the password. If you cannot, set it up as follows:
dolores@workstation: ~$ ssh-keygendolores@workstation: ~$ cat ~/.ssh/id_rsa.pub >> ~/.authorized_keysdolores@workstation: ~$ chmod 600 ~/.authorized_keysdolores@workstation: ~$ ssh localhostdolores@workstation: ~$ exitDownload the suitable version of JDK for your system here, and then install it:
dolores@workstation: ~$ tar xf jdk-8u121-linux-x64.tar.gzdolores@workstation: ~$ sudo cp -r jdk1.8.0_121 /opt/jdk1.8.0_1211. install and set up Hadoop
Download and install:
dolores@workstation: ~$ wget http://mirror.olnevhost.net/pub/apache/hadoop/common/hadoop-2.6.5/hadoop-2.6.5.tar.gzdolores@workstation: ~$ tar xf hadoop-2.6.5.tar.gzdolores@workstation: ~$ sudo mv hadoop-2.6.5 /optConfigure Hadoop:
dolores@workstation: ~$ cd /opt/hadoop-2.6.5/dolores@workstation: hadoop-2.6.5$ mkdir datadolores@workstation: cd etc/hadoop/ Edit the following files that are distributed with Hadoop as follows:
hadoop-env.sh:
change the line export JAVA_HOME=${JAVA_HOME} to export JAVA_HOME=/opt/jdk1.8.0_121
core-site.xml:
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration>hdfs-site.xml:
<configuration> <property> <name>dfs.replication</name> <value>2</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:///opt/hadoop-2.6.5/data/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:///mnt/hadoop/disk1,file:///mnt/hadoop/disk2,file:///mnt/hadoop/disk3,file:///mnt/hadoop/disk4,file:///mnt/hadoop/disk5,file:///mnt/hadoop/disk6</value> </property> <property> <name>dfs.blocksize</name> <value>33554432</value> </property> </configuration>Format HDFS and start the service:
dolores@workstation: hadoop$ cd ../..dolores@workstation: hadoop-2.6.5$ ./bin/hdfs namenode -formatdolores@workstation: hadoop-2.6.5$ ./sbin/start-dfs.shCheck if it is working properly:
dolores@workstation: hadoop-2.6.5$ ./bin/hdfs dfs -mkdir -p /user/doloresdolores@workstation: hadoop-2.6.5$ ./bin/hdfs dfs -put etc/hadoop/core-site.xmldolores@workstation: hadoop-2.6.5$ ./bin/hdfs dfs -cat core-site.xmldolores@workstation: hadoop-2.6.5$ ./bin/hdfs dfs -rm core-site.xmlAdd the following to the file ~/.profile and re-login:
export PATH=/opt/hadoop-2.6.5/bin:$PATH2. install and configure Spark
Download and install:
dolores@workstation: ~$ wget http://mirror.cogentco.com/pub/apache/spark/spark-2.1.0/spark-2.1.0-bin-without-hadoop.tgzdolores@workstation: ~$ tar xf spark-2.1.0-bin-without-hadoop.tgzdolores@workstation: ~$ sudo mv spark-2.1.0-bin-without-hadoop /opt/spark-2.1.0Configure Spark:
dolores@workstation: ~$ cd /opt/spark-2.1.0/confdolores@workstation: conf$ cp spark-env.sh.template spark-env.shdolores@workstation: conf$ cp spark-defaults.conf.template spark-defaults.confdolores@workstation: conf$ cp log4j.properties.template log4j.properties Edit the following files that are distributed with Spark as follows:
spark-env.sh:
export HADOOP_HOME=/opt/hadoop-2.6.5 export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${HADOOP_HOME}/lib/native export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop source ${HADOOP_CONF_DIR}/hadoop-env.sh export SPARK_DIST_CLASSPATH=$(${HADOOP_HOME}/bin/hadoop classpath) export SPARK_WORKDER_MEMORY=52g export SPARK_WORKER_CORES=16 Note: set $SPARK_WORKER_MEMORY and $SPARK_WORKER_CORES to a reasonable value for your server.
spark-default.conf:
spark.master spark://localhost:7077 spark.executor.memory 52g spark.driver.memory 8g spark.driver.maxResultSize 6g spark.serializer org.apache.spark.serializer.KryoSerializer spark.rdd.compress true spark.kryoserializer.buffer.max 1024mNote: set the memory usage options to reasonable values for your server.
log4j.properties:
add one line log4j.logger.org.apache.spark=WARN
Start Spark service:
dolores@workstation: conf$ cd /opt/spark-2.1.0dolores@workstation: spark-2.1.0$ ./sbin/start-all.shCheck if it is working properly:
dolores@workstation: spark-2.1.0$ ./bin/spark-shellAdd the following to the file ~/.profile and re-login:
export PATH=/opt/spark-2.1.0/bin:$PATH0. preparation
Suppose the cluster to be set-up consist of 1 master server node0 and 8 slave servers node[1-8]. Each slave server has 3 spare hard drives, 16 CPU cores, 64 GB memory. Format and mount the hard drives separately. The following is observed for each slave server:
dolores@node1: ~$ df -ThFilesystem Type Size Used Avail Use% Mounted on/dev/sda1 ext4 459G 114G 345G 25% /mnt/hadoop/disk1/dev/sdb1 ext4 459G 113G 346G 25% /mnt/hadoop/disk2/dev/sdc1 ext4 459G 113G 347G 25% /mnt/hadoop/disk3Make sure you can ssh to any of the nodes without typing the password.
Download the suitable version of JDK for your system here, and then install it for all nodes as follows:
dolores@node0: ~$ tar xf jdk-8u121-linux-x64.tar.gzdolores@node0: ~$ sudo cp -r jdk1.8.0_121 /opt/jdk1.8.0_1211. install and set up Hadoop
For all nodes, download and install:
dolores@node0: ~$ wget http://mirror.olnevhost.net/pub/apache/hadoop/common/hadoop-2.6.5/hadoop-2.6.5.tar.gzdolores@node0: ~$ tar xf hadoop-2.6.5.tar.gzdolores@node0: ~$ sudo mv hadoop-2.6.5 /optConfigure Hadoop:
dolores@node0: ~$ cd /opt/hadoop-2.6.5/dolores@node0: hadoop-2.6.5$ mkdir datadolores@node0: cd etc/hadoop/ For all nodes, edit the following files that are distributed with Hadoop as follows:
hadoop-env.sh:
change the line export JAVA_HOME=${JAVA_HOME} to export JAVA_HOME=/opt/jdk1.8.0_121
core-site.xml:
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://node0:9000</value> </property> <property> <name>io.file.buffer.size</name> <value>131072</value> </property> </configuration> For node0, edit the following files:
hdfs-site.xml:
<configuration> <property> <name>dfs.replication</name> <value>2</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:///opt/hadoop-2.6.5/data/namenode</value> </property> <property> <name>dfs.blocksize</name> <value>33554432</value> </property> </configuration>slaves:
node1node2node3node4node5node6node7node8For node[1-8], edit the following file:
hdfs-site.xml
<configuration> <property> <name>dfs.datanode.data.dir</name> <value>file:///mnt/hadoop/disk1,file:///mnt/hadoop/disk2,file:///mnt/hadoop/disk3</value> </property></configuration>Format HDFS and start the service:
dolores@node0: hadoop$ cd ../..dolores@node0: hadoop-2.6.5$ ./bin/hdfs namenode -formatdolores@node0: hadoop-2.6.5$ ./sbin/start-dfs.shCheck if it is working properly:
dolores@node0: hadoop-2.6.5$ ./bin/hdfs dfs -mkdir -p /user/doloresdolores@node0: hadoop-2.6.5$ ./bin/hdfs dfs -put etc/hadoop/core-site.xmldolores@node0: hadoop-2.6.5$ ./bin/hdfs dfs -cat core-site.xmldolores@node0: hadoop-2.6.5$ ./bin/hdfs dfs -rm core-site.xmlAdd the following to the file ~/.profile and re-login:
export PATH=/opt/hadoop-2.6.5/bin:$PATH2. install and configure Spark
For all nodes, Download and install:
dolores@node0: ~$ wget http://mirror.cogentco.com/pub/apache/spark/spark-2.1.0/spark-2.1.0-bin-without-hadoop.tgzdolores@node0: ~$ tar xf spark-2.1.0-bin-without-hadoop.tgzdolores@node0: ~$ sudo mv spark-2.1.0-bin-without-hadoop /opt/spark-2.1.0Configure Spark:
dolores@node0: ~$ cd /opt/spark-2.1.0/confdolores@node0: conf$ cp spark-env.sh.template spark-env.shdolores@node0: conf$ cp spark-defaults.conf.template spark-defaults.confdolores@node0: conf$ cp log4j.properties.template log4j.properties For all nodes, edit the following files:
spark-env.sh:
export HADOOP_HOME=/opt/hadoop-2.6.5 export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${HADOOP_HOME}/lib/native export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop source ${HADOOP_CONF_DIR}/hadoop-env.sh export SPARK_DIST_CLASSPATH=$(${HADOOP_HOME}/bin/hadoop classpath) export SPARK_WORKER_CORES=16 export SPARK_WORKDER_MEMORY=52g Note: set the $SPARK_WORKER_MEMORY to a reasonable value for your server.
spark-default.conf:
spark.master spark://node0:7077 spark.executor.memory 52g spark.driver.memory 8g spark.driver.maxResultSize 6g spark.serializer org.apache.spark.serializer.KryoSerializer spark.rdd.compress true spark.kryoserializer.buffer.max 1024mNote: set the memory usage options to reasonable values for your server.
For node0, edit the following files:
log4j.properties:
add one line log4j.logger.org.apache.spark=WARN
slaves:
node1node2node3node4node5node6node7node8Start Spark service:
dolores@node0: conf$ cd /opt/spark-2.1.0dolores@node0: spark-2.1.0$ ./sbin/start-all.shCheck if it is working properly:
dolores@node0: spark-2.1.0$ ./bin/spark-shellAdd the following to the file ~/.profile and re-login:
export PATH=/opt/spark-2.1.0/bin:$PATH