Zhang Di & Suzanne M. Leal
SEQSpark Manual1. Getting started1.1. Prerequisites1.2. Installation1.3. How to run SEQSpark2. Tutorial2.1. Preparation2.2. Basic quality control2.3. Single variant association testing2.4. Rare variant aggregate tests2.5. Running SEQSpark and collecting the results3. Reference of configurations3.1. root3.2. input3.3. qualityControl3.4. association3.4.1. trait3.4.2. method3.5. meta-analysisAppendicesA.1. Setting-up Spark on a workstationA.2. Setting-up Spark on a cluster
SEQSpark was developed to analyze large-scale genotype data that consists of many samples and a large number of variants generated from Whole-Genome-Sequencing (WGS) or Exome-Sequencing (ES) project. SEQSpark can also analyze imputed genotype data. In this manual, is described SEQSpark's functionalities in performing anotation, data quality control and association testing.
The comiple tool sbt
is needed to build SEQSpark. If it is not available in your environment, you can download it here.
Since SEQSpark is based on Spark, to use it you need a computational environment which can run Spark. If you don't have a ready-to-use Spark server/cluster, you can follow the guides in appendices for installation. For more details, please refer to the official websites of Hadoop and Spark.
To obtain the source code:
dolores@cluster: ~$ git clone https://github.com/statgenetics/seqspark.git
To compile the source code and install:
dolores@cluster: seqspark$ ./install --prefix ~/software/seqspark --db-dir /user/dolores/seqspark
If you have never compiled SEQSpark
before, it may take some time to download its dependencies. The install
script not only compiles the source code, but also downloads the RefSeq and dbSNP databases. For other prebuilt databases, you need to use the script seqspark-db
. The CADD database is quite large (65GB), so depending on the network connection it may take quite some time to download.
In the configuration file, specify the paths of the input genotype file (VCF) and a phenotype file (TSV), and the pipeline with the parameters you want to run. With the configuration file and all the necessary input files ready, you can perform your analysis as follows:
dolores@cluster: ~$ seqspark SingleStudy some.conf [spark-options]
We will explain the parameters that can be set in the configuration file, and how to use them to perform data quality control, annotation and association tests in the following chapters.
In this tutorial, we will guide you in the analysis of a simulated dataset of 2000 exomes. It will cover a large range of (but not all) parameters that can be controlled in the configuration file. For a detailed description, please refer to Chapter 3.
After SEQSpark
is installed as described in 1.2.
Download the dataset and upload the data into the Hadoop file system (HDFS) for analysis using the put
command:
dolores@cluster: demo$ wget http://seqspark.statgen.us/simulated.vcf.bz2
dolores@cluster: demo$ hdfs dfs -put simulated.vcf.bz2
dolores@cluster: demo$ wget http://seqspark.statgen.us/simulated.tsv
dolores@cluster: demo$ hdfs dfs -put simulated.tsv
Modify the configuration file to specify the desired QC pipeline, adding more parameters as the QC is performed.
Filter genotypes based on DP and GQ score.
First, create a configuration file:
dolores@cluster: demo$ vim demo.conf
In the demo file, type or paste the following contents:
seqspark {
project = demo
pipeline = [ "qualityControl" ]
input {
genotype.path = "simulated.vcf.bz2"
phenotype.path = "simulated.tsv"
}
qualityControl {
genotypes = ["DP >= 8 and GQ >= 20"]
}
}
Below is an explanation of each of the files parameters:
seqspark {}
is mandated and anything outside the curly brackets will be ignored.
project
is the name of the run, and it will be used as a suffix when creating cached dataset in HDFS.
pipeline
is an array of procedures to be performed, here we will only perform the qualityControl
procedure, and its parameters are specified below.
input
is the configuration for genotype and phenotype data. You will need to specify the path for the location of these files. By default the location of the files are within HDFS, and the path refers to the location with HDSF. Usually large files, e.g. genotypes are stored within HDFS for the analysis. You can also specify a local path for example the phenotype file as fellows "file:///home/dolores/demo/simulated.tsv"
. It should be noted that only smaller sized files such as the phenotype file should be stored locally on the server/cluster.
qualityControl
contains the parameters for the quality control procedures. In addition to the expression qualityControl {genotypes = ["DP >= 8 and GQ >= 20"]}
the term can be expressed as
qualityControl.genotypes=["DP >= 8 and GQ >= 20"]
. Additionally instead including "and"
between each filter they can simply be listed as follows genotypes = ["DP >= 8", "GQ >= 20"]
.
Filter variants
Modify the configuration file as follows:
seqspark {
project = demo
pipeline = [ "qualityControl" ]
input {
genotype.path = "simulated.vcf.bz2"
phenotype.path = "simulated.tsv"
}
qualityControl {
genotypes = ["DP >= 8 and GQ >= 20"]
variants = ["missingRate <= 0.1"]
}
}
The only difference here compared to above that an additional filter was added to remove variants are missing 10% or more of their genotype data. Other filters such as hwePvalue
can also be used here.
Generate summaries for genotype data
Modify the configuration file as follows:
seqspark {
project = demo
pipeline = [ "qualityControl" ]
input {
genotype.path = "simulated.vcf.bz2"
phenotype.path = "simulated.tsv"
}
qualityControl {
genotypes = ["DP >= 8 and GQ >= 20"]
variants = ["missingRate <= 0.1"]
summaries = ["pca", "titv"]
}
}
This will perform Principal Component Analysis (PCA) of the genotype data, and calculate Ti/Tv ratios for the variants. Here the default minor allele frequency (MAF) of selecting variants with a MAF>0.01 is used and the default for the number of principal components is 10. Later it will be explained how to select variants with other frequencies as well as prune variants for intermarker linkage disequilbrium (LD). The principal components can be used for data quality control and also loaded to control for substructure and admixture when performing association testing. The results are stored in the output/pca.txt and output/titv.txt files.
To perform association testing, we need to add another section association
to the configuration file:
seqspark {
project = demo
pipeline = [ "qualityControl", "association" ]
input {
genotype.path = "simulated.vcf.bz2"
phenotype.path = "simulated.tsv"
}
qualityControl {
genotypes = ["DP >= 8 and GQ >= 20"]
variants = ["missingRate <= 0.1"]
summaries = ["pca", "titv"]
}
association {
trait {
list = ["bmi", "disease"]
bmi {
binary = false
covariates = ["sex", "age", "disease"]
pc = 0
}
disease {
binary = true
covariates = ["sex", "age", "bmi"]
pc = 0
}
}
method {
list = [snv]
}
}
}
Although it looks lengthy at first, the structure is clear and simple. First, change to the pipeline list to include association
. The added association
section consists of two subsections, trait
and method
.
trait
describes the phenotypes to be analyzed and the corresponding covariates to include in the regression model. It needs to be specified if the trait is binary (false would denote that the trait is quantitative), and how many principal components to include in the regression analysis. Here a binary trait disease and a quantitative trait BMI are being analyzed.
method
describes the methods to use. For single variant associaton analysis snv
is used to denote the type of analysis which should be performed. For the single variant association analysis currently the additive test is supported.
To perform rare variant aggregate tests, the methods are added to the list as shown below:
seqspark {
project = demo
pipeline = [ "qualityControl", "association" ]
input {
genotype.path = "simulated.vcf.bz2"
phenotype.path = "simulated.tsv"
}
qualityControl {
genotypes = ["DP >= 8 and GQ >= 20"]
variants = ["missingRate <= 0.1"]
summaries = ["pca", "titv"]
}
association {
trait {
list = ["bmi", "disease"]
bmi {
binary = false
covariates = ["sex", "age", "disease"]
pc = 0
}
disease {
binary = true
covariates = ["sex", "age", "bmi"]
pc = 0
}
}
method {
list = ["snv", "cmc", "brv", "vt", "skat", "skato"]
}
}
}
In the list
besides snv
is now included five rare variant aggregate tests: cmc
the Combined Multivariate Collapsing; brv
Burden of Rare Variants; vt
Variable Threashold; skat
Sequence Kernel Assocation Test; and skato
SKAT-optimal. It should be noted that for rare variant analysis annotation using refseq is performed before the analysis. By default all splice site, frameshift, missense and nonsense variants for each gene are analyzed that have a . It is possible to annotate using additional databases such as CADD and specfiy which variants to analyze e.g. . Also the user can specify the MAF cut-off to be used in the analysis. The specifics will be covered later in the documentation.
Now that the configuration file is ready, it can be run with SEQSpark
:
dolores@cluster demo$ seqspark SingleStudy demo.conf
In the current directory, an output
directory with be created, and there you can find:
titv.txt
summarizing the Ti/Tv ratio for each sample and for the dataset.
pca.txt
contains the principal components values for each sample's genotypes for the 10 components.
A series of assoc_$trait_$method
files contains the results of the association analysis for each trait and method combination.
seqspark {
project = seqspark
pipeline = [ "qualityControl", "association" ]
partitions = 1024
}
project
: The name of your project
pipeline
: The pipeline to run
partitions
: The number of spark partitions to use.
seqspark {
input {
genotype {
format = vcf
path = ${seqspark.project}.vcf
filters = []
variants = all
}
phenotype {
path = ${seqspark.project}.tsv
batch = none
}
}
}
genotype.format
: The format of the input genotype file, can be either vcf
or impute2
. When impute2
data is used, the analysis will be based on dosages for both single variant and rare aggregate association analysis.
genotype.path
: The path in HDFS for the genotype data file.
genotype.filters
:
filters = ["FILTER == \"PASS\"", "INFO.AN > 1800"]
filters = ["INFO >= 0.8"]
genotype.variants
: Load variants by regions. Valid values are
"chr1"
,"chr1:1-30000000,chr2"
phenotype.path
: The path on HDFS or local filesystem for the phenotype data file.
phenotype.batch
: To specify the batch when genetic data was generated in batches. Some quality control can be performed by batches to investigate and/or correct for batch specific artifacts or effects. The default is none
, which means no batch information available.
seqspark {
qualityControl {
genotypes = []
variants = []
summaries = []
pca {
variants = [
"informative",
"missingRate <= 0.1",
"chr != \"X\"",
"chr != \"Y\"",
"maf >= 0.01 and maf <= 0.99",
"hwePvalue >= 13-5"
]
noprune = true
prune {
window = 50
step = 5
r2 = 0.5
}
}
save = false
export = false
}
}
genotypes
: The filters for genotypes.
variants
: The filters for variants.
chr
: chromosomemaf
: alternative allele frequencyinformative
: not a monomorphic sitemissingRate
: overall missing ratebatchMissingRate
: the maximum batch specific missing ratebatchSpecific
: the batch specific variants cutoffisFunctional
: the variant site is functional i.e. splice site, missense, frameshift and nonsensehwePvalue
: the P value of HWE test.summaries
: functions to perform for genotype data which include
pca
: perform PCA analysistitv
: calculate Ti/Tv ratiopca
: how to prepare data for PCA
pca.variants
: variants inclusion standard for PCA. For example exclude variants on the X and Y chromosomespca.noprune
: do not prune SNP based on LDpca.prune
: LD prune parameters, similar to the PLINK pare-wise LD pruning.save
: save the clean dataset to HDFS in internal format
export
: export the clean dataset to HDFS in VCF format
The association section contains two subsections, trait
and method
.
seqspark {
association {
trait {
list = [someTrait]
someTrait {
key1 = value1
}
}
method {
list = [someMethod]
someMethod {
key1 = value1
}
}
}
}
trait
contains a list
of phenotypes to analyze and the parameters for each phenotype.
someTrait {
binary = false #true
covariates = ["c1", "c2", ]
pc = 0
}
binary
: true
for binary trait and false
for quantitative trait
covariates
: a list of covariates in the phenotype file to be included in the regression model
pc
: number of principal components to be included in the regression model
method
contains a list
of methods to use and the parameters for each method.
Single variant test
snv {
test = score #wald
type = snv
resampling = false
maf {
source = "pooled"
cutoff = 0.01
}
}
For single variant association analysis, the following commands can be used
test
: Can either perform the score or Wald test
type
: must be snv
for single variant test
resampling
: generate empirical p value when true or analytical p-values when false
maf.source
: which source to use to calculate the minor allele frequency. This can be from a database such as ExAC or gnomAD
"pooled"
: using all samples"controls"
: using only controls, this is only valid when the trait is binary, otherwise pooled sample will be used. It is not recommened to use controls since it has been shown to increase type I error."some_key"
: using "some_key"
to extract value from the INFO field. The value must be a number between 0 and 1. otherwise, pooled sample will be used.maf.cutoff
: only perform analysis for variants with cutoff. Single variant association tests have very low power for rare variants.
Burden tests
burden {
test = score #wald
type = cmc #brv
resampling = false
weight = none #wss
maf = {
source = "pooled"
cutoff = 0.01
fixed = true
}
misc {
varLimit = [ 2, 2000 ]
groupBy = ["gene"]
variants = ["isFunctional"]
weightBeta = [1, 25]
}
}
The resampling
and maf.source
options are the same as for the single variant analysis. The maf.cutoff
specificies the frequency of the variants to be used in the analysis cutoff unlike for snv
analysis it is of interest to analyze rare variants.
type
: the coding style.
cmc
: Combined Multivariate Collapsing method - uses binary coding - (1) one or more rare variants are included in the region or (0) no rare variantsbrv
: Burden of Rare Variants - analyzes the number of variants within a regionweight
: Variant weights can be used for all tests except for CMC
wss
: none
: no weight, or the weights are equal for all variantsskat
: weighted style for SKAT - , where and are 1 and 25 by default, and can be specified in misc
."some_key"
: use "some_key"
in the INFO field of the VCFmaf.fixed
: use fixed MAF cutoff to perform the cmc
or brv
or to perform the variable threshold (VT) - set maf.fixed to false. The type of coding which can be used for VT can be either cmc
or brv
. For VT, resampling
will be set to true as the default.
misc
: more options for rare variant tests
misc.varLimit
: skip regions/genes with variants less than the lower bound and more than the upper bound. The default is varLimit = [ 2, 2000 ]misc.groupBy
: group variants by genes. Can also be grouped by ["slidingWindow", size, overlap]
or other regulatory regions. For example using annotation from ENCODEmisc.variants
: only include variants that satisfy these filters. Annotations from various bioinformatic tools/databases can be used. e.g. ["CADD.score >= 3"]
misc.weightBeta
: the and values for skat style weightSKAT/SKAT-O tests
skato {
type = skato #skat
weight = skat
maf {
source = pooled
cutoff = 0.01
}
misc {.
varLimit = 2000
groupBy = ["gene"]
variants = ["isFunctional"]
method = "optimal.adj"
rhos = []
kernel = "linear.weighted"
weightBeta = [1, 25]
}
}
type
can be skat
or skato
. weight
and maf
are similar to burden tests, except that set maf.fixed = false
and test
has no effect. In the misc
, method
, rhos
, kernel
, and weightBeta
are similar to their counterparts in the original SKAT R package. For more detail please see the R package documentation.
Summary statistics for meta-analysis
sum {
type = sum
maf.source = pooled
misc {
varLimit = 2000
groupBy = ["slidingWindow", 1000000, 0]
variants = []
}
}
Set type = sum
to generate the score statistics and the covariance matrix. Other parameters are similar to the above association analysis methods. By default, set groupBy = ["slidingWindow", 1000000, 0]
to generate summary statistics for variants in every 1 Mbp region. The summary statistics can later be used to perform single variant or rare variant aggregate meta-analysis.
In addition to performing association analysis for a single study, SEQSpark can perform meta-analysis using summary statistics generated from single study analyses.
seqspark {
project = someName
partitions = 1024
meta {
studies = ["study1", "study2"]
study1 {
format = "seqspark"
path = "hdfs:///user/dolores/data/study1.meta"
}
study2 {
format = "seqspark"
path = "hdfs://user/dolores/data/study2.meta"
}
method {
list = ["methodName"]
methodName {
key1 = value1
}
}
}
}
project
: project name, same as the single study analysis
partitions
: number of Spark partitions
meta
: parameters controls how to perform meta-analysis
studies
: a list of studies
someStudy
: contains format and path of the summary statistics of the study
someStudy.format
: now only supports the results from SEQSpark single study analysis, but will be expanded to additional formatssomeStudy.path
: the location of the summary statisticsmethod
: parameters controls the methods to use
method.list
: a list of methods to use, e.g. SKAT, BRV etc. method.methodName
: The name of the methods is provided, e.g. method.skat
. The analysis used for for meta-analysis is similiar to single study association analysis with the exception: permuation based analysis cannot be used, since it is based on summary statistics, e.g. VT, is not available. To run meta-analysis, use the command as follows:
dolores@cluster demo$ seqspark MetaAnalysis meta.conf
0. preparation
Suppose the workstation has six spare hard drives, 16 CPU cores, and 64 GB memory. The following is observed after formating and mounting the hard drives separately:
dolores@workstation: ~$ df -Th
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda1 ext4 459G 114G 345G 25% /mnt/hadoop/disk1
/dev/sdb1 ext4 459G 113G 346G 25% /mnt/hadoop/disk2
/dev/sdc1 ext4 459G 113G 347G 25% /mnt/hadoop/disk3
/dev/sdd1 ext4 459G 112G 347G 25% /mnt/hadoop/disk4
/dev/sde1 ext4 459G 114G 346G 25% /mnt/hadoop/disk5
/dev/sdf1 ext4 459G 113G 346G 25% /mnt/hadoop/disk6
...
Make sure you can ssh to the localhost
without typing the password. If you cannot, set it up as follows:
dolores@workstation: ~$ ssh-keygen
dolores@workstation: ~$ cat ~/.ssh/id_rsa.pub >> ~/.authorized_keys
dolores@workstation: ~$ chmod 600 ~/.authorized_keys
dolores@workstation: ~$ ssh localhost
dolores@workstation: ~$ exit
Download the suitable version of JDK for your system here, and then install it:
dolores@workstation: ~$ tar xf jdk-8u121-linux-x64.tar.gz
dolores@workstation: ~$ sudo cp -r jdk1.8.0_121 /opt/jdk1.8.0_121
1. install and set up Hadoop
Download and install:
dolores@workstation: ~$ wget http://mirror.olnevhost.net/pub/apache/hadoop/common/hadoop-2.6.5/hadoop-2.6.5.tar.gz
dolores@workstation: ~$ tar xf hadoop-2.6.5.tar.gz
dolores@workstation: ~$ sudo mv hadoop-2.6.5 /opt
Configure Hadoop:
dolores@workstation: ~$ cd /opt/hadoop-2.6.5/
dolores@workstation: hadoop-2.6.5$ mkdir data
dolores@workstation: cd etc/hadoop/
Edit the following files that are distributed with Hadoop as follows:
hadoop-env.sh
:
change the line export JAVA_HOME=${JAVA_HOME}
to export JAVA_HOME=/opt/jdk1.8.0_121
core-site.xml
:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
hdfs-site.xml
:
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///opt/hadoop-2.6.5/data/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name> <value>file:///mnt/hadoop/disk1,file:///mnt/hadoop/disk2,file:///mnt/hadoop/disk3,file:///mnt/hadoop/disk4,file:///mnt/hadoop/disk5,file:///mnt/hadoop/disk6</value>
</property>
<property>
<name>dfs.blocksize</name>
<value>33554432</value>
</property>
</configuration>
Format HDFS and start the service:
dolores@workstation: hadoop$ cd ../..
dolores@workstation: hadoop-2.6.5$ ./bin/hdfs namenode -format
dolores@workstation: hadoop-2.6.5$ ./sbin/start-dfs.sh
Check if it is working properly:
dolores@workstation: hadoop-2.6.5$ ./bin/hdfs dfs -mkdir -p /user/dolores
dolores@workstation: hadoop-2.6.5$ ./bin/hdfs dfs -put etc/hadoop/core-site.xml
dolores@workstation: hadoop-2.6.5$ ./bin/hdfs dfs -cat core-site.xml
dolores@workstation: hadoop-2.6.5$ ./bin/hdfs dfs -rm core-site.xml
Add the following to the file ~/.profile
and re-login:
export PATH=/opt/hadoop-2.6.5/bin:$PATH
2. install and configure Spark
Download and install:
dolores@workstation: ~$ wget http://mirror.cogentco.com/pub/apache/spark/spark-2.1.0/spark-2.1.0-bin-without-hadoop.tgz
dolores@workstation: ~$ tar xf spark-2.1.0-bin-without-hadoop.tgz
dolores@workstation: ~$ sudo mv spark-2.1.0-bin-without-hadoop /opt/spark-2.1.0
Configure Spark:
dolores@workstation: ~$ cd /opt/spark-2.1.0/conf
dolores@workstation: conf$ cp spark-env.sh.template spark-env.sh
dolores@workstation: conf$ cp spark-defaults.conf.template spark-defaults.conf
dolores@workstation: conf$ cp log4j.properties.template log4j.properties
Edit the following files that are distributed with Spark as follows:
spark-env.sh
:
export HADOOP_HOME=/opt/hadoop-2.6.5
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${HADOOP_HOME}/lib/native
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
source ${HADOOP_CONF_DIR}/hadoop-env.sh
export SPARK_DIST_CLASSPATH=$(${HADOOP_HOME}/bin/hadoop classpath)
export SPARK_WORKDER_MEMORY=52g
export SPARK_WORKER_CORES=16
Note: set $SPARK_WORKER_MEMORY
and $SPARK_WORKER_CORES
to a reasonable value for your server.
spark-default.conf
:
spark.master spark://localhost:7077
spark.executor.memory 52g
spark.driver.memory 8g
spark.driver.maxResultSize 6g
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.rdd.compress true
spark.kryoserializer.buffer.max 1024m
Note: set the memory usage options to reasonable values for your server.
log4j.properties
:
add one line log4j.logger.org.apache.spark=WARN
Start Spark service:
dolores@workstation: conf$ cd /opt/spark-2.1.0
dolores@workstation: spark-2.1.0$ ./sbin/start-all.sh
Check if it is working properly:
dolores@workstation: spark-2.1.0$ ./bin/spark-shell
Add the following to the file ~/.profile
and re-login:
export PATH=/opt/spark-2.1.0/bin:$PATH
0. preparation
Suppose the cluster to be set-up consist of 1 master server node0
and 8 slave servers node[1-8]
. Each slave server has 3 spare hard drives, 16 CPU cores, 64 GB memory. Format and mount the hard drives separately. The following is observed for each slave server:
dolores@node1: ~$ df -Th
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda1 ext4 459G 114G 345G 25% /mnt/hadoop/disk1
/dev/sdb1 ext4 459G 113G 346G 25% /mnt/hadoop/disk2
/dev/sdc1 ext4 459G 113G 347G 25% /mnt/hadoop/disk3
Make sure you can ssh to any of the nodes without typing the password.
Download the suitable version of JDK for your system here, and then install it for all nodes as follows:
dolores@node0: ~$ tar xf jdk-8u121-linux-x64.tar.gz
dolores@node0: ~$ sudo cp -r jdk1.8.0_121 /opt/jdk1.8.0_121
1. install and set up Hadoop
For all nodes, download and install:
dolores@node0: ~$ wget http://mirror.olnevhost.net/pub/apache/hadoop/common/hadoop-2.6.5/hadoop-2.6.5.tar.gz
dolores@node0: ~$ tar xf hadoop-2.6.5.tar.gz
dolores@node0: ~$ sudo mv hadoop-2.6.5 /opt
Configure Hadoop:
dolores@node0: ~$ cd /opt/hadoop-2.6.5/
dolores@node0: hadoop-2.6.5$ mkdir data
dolores@node0: cd etc/hadoop/
For all nodes, edit the following files that are distributed with Hadoop as follows:
hadoop-env.sh
:
change the line export JAVA_HOME=${JAVA_HOME}
to export JAVA_HOME=/opt/jdk1.8.0_121
core-site.xml
:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://node0:9000</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
</configuration>
For node0
, edit the following files:
hdfs-site.xml
:
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///opt/hadoop-2.6.5/data/namenode</value>
</property>
<property>
<name>dfs.blocksize</name>
<value>33554432</value>
</property>
</configuration>
slaves
:
node1
node2
node3
node4
node5
node6
node7
node8
For node[1-8]
, edit the following file:
hdfs-site.xml
<configuration>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///mnt/hadoop/disk1,file:///mnt/hadoop/disk2,file:///mnt/hadoop/disk3</value>
</property>
</configuration>
Format HDFS and start the service:
dolores@node0: hadoop$ cd ../..
dolores@node0: hadoop-2.6.5$ ./bin/hdfs namenode -format
dolores@node0: hadoop-2.6.5$ ./sbin/start-dfs.sh
Check if it is working properly:
dolores@node0: hadoop-2.6.5$ ./bin/hdfs dfs -mkdir -p /user/dolores
dolores@node0: hadoop-2.6.5$ ./bin/hdfs dfs -put etc/hadoop/core-site.xml
dolores@node0: hadoop-2.6.5$ ./bin/hdfs dfs -cat core-site.xml
dolores@node0: hadoop-2.6.5$ ./bin/hdfs dfs -rm core-site.xml
Add the following to the file ~/.profile
and re-login:
export PATH=/opt/hadoop-2.6.5/bin:$PATH
2. install and configure Spark
For all nodes, Download and install:
dolores@node0: ~$ wget http://mirror.cogentco.com/pub/apache/spark/spark-2.1.0/spark-2.1.0-bin-without-hadoop.tgz
dolores@node0: ~$ tar xf spark-2.1.0-bin-without-hadoop.tgz
dolores@node0: ~$ sudo mv spark-2.1.0-bin-without-hadoop /opt/spark-2.1.0
Configure Spark:
dolores@node0: ~$ cd /opt/spark-2.1.0/conf
dolores@node0: conf$ cp spark-env.sh.template spark-env.sh
dolores@node0: conf$ cp spark-defaults.conf.template spark-defaults.conf
dolores@node0: conf$ cp log4j.properties.template log4j.properties
For all nodes, edit the following files:
spark-env.sh
:
export HADOOP_HOME=/opt/hadoop-2.6.5
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${HADOOP_HOME}/lib/native
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
source ${HADOOP_CONF_DIR}/hadoop-env.sh
export SPARK_DIST_CLASSPATH=$(${HADOOP_HOME}/bin/hadoop classpath)
export SPARK_WORKER_CORES=16
export SPARK_WORKDER_MEMORY=52g
Note: set the $SPARK_WORKER_MEMORY
to a reasonable value for your server.
spark-default.conf
:
spark.master spark://node0:7077
spark.executor.memory 52g
spark.driver.memory 8g
spark.driver.maxResultSize 6g
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.rdd.compress true
spark.kryoserializer.buffer.max 1024m
Note: set the memory usage options to reasonable values for your server.
For node0
, edit the following files:
log4j.properties
:
add one line log4j.logger.org.apache.spark=WARN
slaves
:
node1
node2
node3
node4
node5
node6
node7
node8
Start Spark service:
dolores@node0: conf$ cd /opt/spark-2.1.0
dolores@node0: spark-2.1.0$ ./sbin/start-all.sh
Check if it is working properly:
dolores@node0: spark-2.1.0$ ./bin/spark-shell
Add the following to the file ~/.profile
and re-login:
export PATH=/opt/spark-2.1.0/bin:$PATH