Running GATK4 Spark on Quest

How to run Genome Analysis Toolkit's Spark tools on Quest.

The Broad Institute’s Genome Analysis Toolkit (GATK) is a widely used best practices pipeline for whole genome sequencing and variant calling. As of GATK version 4, many GATK tools are also available to run on Apache Spark, a unified analytics engine for large-scale data processing which can significantly speed up computation time. Note that GATK4’s Spark tools are currently in beta. Because GATK4 SPARK runs on multiple nodes, it must be launched with a job submission script and cannot be run on a login node.

GATK4 SPARK tools

To see a list of available GATK4 tools:

module load gatk/4.0.4

gatk –-list

GATK Spark tools have the word “Spark” in their name, and can be explicitly listed with grep:

gatk –-list | grep Spark

Additional help is available for each tool with:

gatk ToolName --help

Converting FASTA Files for Parallel Runs

Standard FASTA files do not allow for parallel operation and must be converted to 2bit files. To convert these files, use the binary in $SPARK_TOOLS which is in the SPARK module path.

module load spark/2.3.0

faToTwoBit exampleFASTA.fasta exampleFASTA.2bit

This step only needs to be done once.

Example Job Submission

Below is an example of job submission file for Quest using a converted exampleFASTA.2bit file.

Some notes on tuning parameters (last line in submission script below):

  • --driver-cores= and --executor-cores= should add up to the ppn= defined in MSUB parameters.
  • --driver-memory= and --executor-memory= must add up to less than the total memory available on a single node.

GATKSpark_example.sh
#!/bin/bash
#MSUB -l nodes=2:ppn=24
#MSUB -N SparkTest1
#MSUB -q <queue>
#MSUB -l walltime=00:20:00
#MSUB -A <allocation>

# Move into the working directory (optional)
cd $PBS_O_WORKDIR

# Load environment
module load spark/2.3.0 gatk/4.0.4

# Initialize spark cluster inside job
$SPARK_TOOLS/initialize_spark.sh

# Run GATK HaplotypeCaller (core and memory settings are per-node)
gatk HaplotypeCallerSpark --reference $PBS_O_WORKDIR/exampleFASTA.2bit \
--input $PBS_O_WORKDIR/exampleBAM.bam --output $PBS_O_WORKDIR/exampleVCF.txt \
--spark-runner SPARK \
--spark-master spark://`hostname -i`:7077 -- --driver-cores=2 \
--driver-memory=12g --executor-cores=22 --executor-memory=110GB 2>&1

See Also:




Keywords:research computing, quest, gatk, spark, haploytype, apache, genomics, wgs   Doc ID:86687
Owner:Research Computing .Group:Northwestern
Created:2018-10-11 07:22 CSTUpdated:2018-10-11 13:36 CST
Sites:Northwestern
Feedback:  0   0