Skip to content

UCSC-nanopore-cgl/margin

Repository files navigation

Margin

Margin is a suite of tools used for analysis of long-read data using Hidden Markov Models. It has two submodules: phase and polish.

MarginPhase is a utility for read haplotyping and variant phasing.

MarginPolish is a diploid-aware assembly polisher. Documentation is available here.

Quickstart

The PEPPER-Margin-DeepVariant variant calling pipeline is documented here.

Margin as a standalone tool is most easily used with a Docker image available here:

docker pull kishwars/pepper_deepvariant:latest
docker run kishwars/pepper_deepvariant:latest margin phase -h

Margin requires an indexed BAM, a reference FASTA, and a VCF as user-supplied input. The quickstart command below assumes that these source files are in your current working directory.

Margin requires selection of a parameter file, which is supplied in the docker container and in the repository. For best performance, a parameter file specific to your read data and desired output should be used:

  • to haplotag ONT data use allParams.haplotag.ont-r94g507.json
  • to haplotag PacBio HiFi data use allParams.haplotag.pb-hifi.json
  • to phase a VCF generated using ONT data use allParams.phase_vcf.ont.json
  • to phase a VCF generated using PacBio-HiFi data use: allParams.phase_vcf.pb-hifi.json
docker run \
    -v `pwd`:/data \
    kishwars/pepper_deepvariant:latest \
    margin phase \
    /data/$YOUR_ALIGNMENT_HERE.bam \
    /data/$YOUR_REFERENCE_HERE.fasta \
    /data/$YOUR_VARIANTS_HERE.vcf \
    /opt/margin_dir/params/phase/$PARAMETER_FILE \
    -t $THREAD_COUNT \
    -o /data/$OUTPUT_PREFIX

Running MarginPhase

Usage

margin phase <ALIGN_BAM> <REFERENCE_FASTA> <VARIANT_VCF> <PARAMS> [options]

Input/Output Files

Margin requires an indexed BAM, a reference FASTA, a VCF, and a parameter file. See the section below for a description of the parameterization.

Margin produces:

  • OUTPUT_PREFIX.haplotagged.bam: a BAM file with all reads tagged (HP) as 1, 2, or 0, with 0 indicating no haplotype assignment was made. If a region was specified, only reads from that region are included in the output. This output can be suppressed by using the --skipHaplotypeBAM parameter.
  • OUTPUT_PREFIX.chunks.csv: a CSV describing the boundaries of each chunk
  • OUTPUT_PREFIX.phased.vcf: a VCF with phased variants and phasesets. If a region was specified, only variants from that region are included in the output. This output can be suppressed by using the --skipPhasedVcf parameter.
  • OUTPUT_PREFIX.phaseset.bed: a BED file describing the phasesets and the reason why phasing was broken with respect to the previous phaseset. This output can be suppressed by using the --skipPhasedVcf parameter.

Parameter Files

All testing and tuning has been performed with variants called by the PEPPER-Margin-DeepVariant pipeline, although Margin will work with any variant set.

Pre-configured parameter files are provided in the code repository. The bulk of the parameters are described in the file params/base_params.json, with mode- and sequencing-specific modifications provided in the params/phase/ directory.

Params are divided into two modes: haplotag and phase_vcf. While the core algorithm determines both read haplotags and phasing of variants, there are mode-specific thresholds that should be used for best performance. The haplotag parameters are tuned to produce more phased reads and more accurate local read phasing, and were tuned using variants generated by PEPPER. The phase_vcf parameters are tuned to produce long and accurate phase sets, and were tuned using variants genotyped by DeepVariant.

Margin has parameterizations for multiple sequencing technologies. For ONT data, there are models for the R9.4 pore basecalled with Guppy 4.2.2 (ont-r94g422) and 5.0.7 (ont-r94g507), and for R10.4 Q20 data (ont-r104q20). The ont-r94g507 parameter file is recommended if the data is from a different pore or basecaller. For PacBio data, there are params for CLR and HiFi (pb-clr, pb-hifi). The pb-hifi parameter is recommended if your PacBio data has a different source. Parameter files annotated with hp are for use with PEPPER-HP, and with hapDup are for a SV calling pipeline HapDup.

Runtime Configuration

The following runtime configuration options are available:

  • -a / --logLevel: the values critical, info, and debug options are available. The default value is critical. If a user is experiencing an issue with Margin, a log file generated using the info setting is preferred. Use of debug is not recommendced.
  • -t / --threads: number of threads to use concurrently
  • -o / --outputBase: the prefix given to output files, by default "output" in the current working directory. If a directory is supplied, it must already exist. If only a directory is supplied, the output prefix will be /user/specified/directory/output.
  • -r / --region: a specific region to run in, such as chr3 or chr19:1000000-2000000. If supplied, output will only be written for data falling within this region.

High-Level Workflow

MarginPhase uses the read-based evidence of linkage between heterozygous variant sites to find the most likely assignment of reads and alleles to haplotypes. The tool first selects a set of high-quality primary reads and variants for use in the initial phasing workflow. This selection process preferentially takes longer reads and variants with high allele quality scores, up until certain parameterized thresholds are met.

The Hidden Markov Model describes read partitions at each variant site, enforces compatiblity between partitions at adjacent sites, and determines allele support with read substring alignment likelihoods. The Forward-Backward algorithm is run to determine partition likelihoods at each site, and a most-likely path through the per-site partitions is produced. The read partition described by this path is used to calculate allele support per partition, which is used to assign alleles to haplotypes. Haplotypes for primary reads are determined from this partition, after an iterative refinement step.

Using this set of high-quality phased variants and reads, haplotypes are assigned to the low-quality variants and reads. Low-quality reads are compared to the high-quality phased variants, and low-quality variants are compared to each set of haplotagged high-quality reads. The methodolgy of dividing reads and variants into high- and low-quality categories enables the use of only trusted data during the primary analysis, while still allowing a haplotype to be assigned to all input data.

Work is divided into overlapping chunks to enable multithreading. Stitching is performed using set similarity between haplotype assignments for all reads found in adjacent chunks.

Phasesets are determined after stitching has occurred. It is assumed that all phasing is consistent across a contig unless certain parameterized thresholds are not met. The haplotype assignment of the reads which were used to determine the variant's haplotype are tracked during the primary phasing algorithm and are used during this step. If one of these thresholds (described below) is not met, a new phaseset is created and the reason is described in the OUTPUT_PREFIX.phaseset.bed output file.

Parameterization

The following parameters are used to exclude reads and variants from the high-quality dataset:

  • phase.onlyUseSNPVCFEntries: toggles whether INDEL candidates are excluded
  • phase.minSnpVariantQuality: all SNP variants with a QUAL score below this are excluded
  • phase.minIndelVariantQuality: all INDEL variants with a QUAL score below this are excluded
  • polish.filterAlignmentsWithMapQBelowThisThreshold: all reads with a MQ score below this are excluded

The following parameters are used during adaptive sampling. Adaptive sampling first selects all variants above a threshold, then determines the average distance in basepairs between variants for the chunk. If the average distance is greater than a threshold (ie, if there are too few variants), then variants are selected in descending order of QUAL score until enough variants are selected or there are no more variants above the min quality threshold.

  • phase.useVariantSelectionAdaptiveSampling: whether to use adaptive sampling
  • phase.variantSelectionAdaptiveSamplingPrimaryThreshold: above this threshold, all variants are used
  • phase.variantSelectionAdaptiveSamplingDesiredBasepairsPerVariant: desired average distance between variants

The following parameters are used to select reads:

  • phase.maxDepth: reads are downsampled to an expected depth of this parameter. Longer reads are preferentially selected during downsampling.

The following parameters are used to determine if phasing is questionable or whether the current phaseset should be extended:

  • phase.phasesetMinBinomialReadSplitLikelihood: the cumulative binomial probability of the read partition at this variant (ie a partition of 10:10 has likelihood .588, and 5:15 has likelihood .021) must be above this threshold
  • phase.phasesetMaxDiscordantRatio: the maximum allowed ratio of discordant reads to all tagged reads assigned to the current and previous phased variants' haplotypes
  • phase.phasesetMinSpanningReads: the total number of phased reads spanning two adjacent phased variants

Installation

Dependencies

If compiling on Ubuntu, this will install all required packages:

sudo apt-get install git make gcc g++ autoconf zlib1g-dev libcurl4-openssl-dev libbz2-dev libhdf5-dev

Note that libhdf5-dev is required for HELEN image generation with MarginPolish but is not required.

Margin is compiled with cmake. We recommend using the latest cmake version, but 3.7 and higher are supported:

wget https://github.com/Kitware/CMake/releases/download/v3.14.4/cmake-3.14.4-Linux-x86_64.sh && sudo mkdir /opt/cmake && sudo sh cmake-3.14.4-Linux-x86_64.sh --prefix=/opt/cmake --skip-license && sudo ln -s /opt/cmake/bin/cmake /usr/local/bin/cmake
cmake --version

Compilation

# Check out the repository and submodules:
git clone https://github.com/UCSC-nanopore-cgl/margin.git
cd margin
git submodule update --init

# Make build directory:
mkdir build
cd build

# Generate Makefile and run:
cmake ..
make
./margin

Verification

# haplotagging mode
./margin phase \
    ../tests/data/realData/HG002.r94g360.chr20_59M_100k.bam \
    ../tests/data/realData/hg38.chr20_59M_100k.fa \
    ../tests/data/realData/HG002.r94g360.chr20_59M_100k.vcf \
    ../params/phase/allParams.haplotag.ont-r94g507.json \
    --skipPhasedVCF
# verification: expect 145, 137
samtools view output.haplotagged.bam | grep "HP:i:1" | wc -l
samtools view output.haplotagged.bam | grep "HP:i:2" | wc -l
                
# variant phasing mode
./margin phase \
    ../tests/data/realData/HG002.r94g360.chr20_59M_100k.bam \
    ../tests/data/realData/hg38.chr20_59M_100k.fa \
    ../tests/data/realData/HG002.r94g360.chr20_59M_100k.vcf \
    ../params/phase/allParams.phase_vcf.ont.json \
    --skipHaplotypeBAM
# verification: expect 105
cat output.phased.vcf | grep -e "1|0\|0|1" | wc -l

Resource Requirements

Margin took between 19m (35x PacBio-HiFi) and 80m (75x ONT) for a single whole genome with 64 threads and peak memory usage of 35GB during phasing.

© 2019 by Benedict Paten ([email protected]), Trevor Pesout ([email protected])