SMRT® Tools reference guide (v11.0)

SMRT

Tools

reference guide

(v11.0)

Research use only. Not for use in diagnostic procedures.

P/N 102-278-500 Version 01 (April 2022)

Information in this document is subject to change without notice. PacBio assumes no responsibility for any errors or

omissions in this document.

Certain notices, terms, conditions and/or use restrictions may pertain to your use of PacBio products and/or third

party products. Refer to the applicable PacBio terms and conditions of sale and to the applicable license terms at

http://www.pacificbiosciences.com/licenses.html.

Trademarks:

Pacific Biosciences, the PacBio logo, PacBio, Circulomics, Omnione, SMRT, SMRTbell, Iso-Seq, Sequel, Nanobind,

and SBB are trademarks of Pacific Biosciences of California Inc. (PacBio). All other trademarks are the sole property

of their respective owners.

See https://github.com/broadinstitute/cromwell/blob/develop/LICENSE.txt for Cromwell redistribution information.

PacBio

1305 O’Brien Drive

Menlo Park, CA 94025

www.pacb.com

Page 1

Introduction

This document describes the command-line tools included with SMRT

Link v11.0. These tools are for use by bioinformaticians working with

secondary analysis results.

• The command-line tools are located in the

$SMRT_ROOT/smrtlink/

smrtcmds/bin

subdirectory.

Installation

The command-line tools are installed as an integral component of the

SMRT Link software. For installation details, see SMRT Link software

installation guide (v11.0).

• Note: SMRT Link v11.0 is for use with Sequel

II systems and Sequel IIe

systems only.

• To install only the command-line tools, use the

--smrttools-only

option with the installation command, whether for a new installation or

an upgrade. Examples:

smrtlink-*.run --rootdir smrtlink --smrttools-only

smrtlink-*.run --rootdir smrtlink --smrttools-only --upgrade

Supported chemistry

SMRT Link v11.0 supports all chemistry versions for Sequel II systems.

PacBio command-line tools

Following is information on the PacBio-supplied command-line tools

included in the installation. For information on third-party tools installed, see

“Appendix B - Third party command-line tools” on page 113.

Tool Description

bam2fasta/

bam2fastq

Converts PacBio

BAM files into gzipped FASTA and FASTQ files.

See “bam2fasta/bam2fastq” on page 3.

bamsieve Generates a subset of a BAM or PacBio Data Set file based on either a list of hole

numbers, or a percentage of reads to be randomly selected.

See “bamsieve” on page 4.

ccs Calculates consensus sequences from multiple “passes” around a circularized

single DNA molecule (SMRTbell

template). See “ccs” on page 7.

dataset Creates, opens, manipulates and writes Data Set XML files.

See “dataset” on page 16.

Demultiplex

Barcodes

Identifies barcode sequences in PacBio single-molecule sequencing data.

See “Demultiplex Barcodes” on page 23.

SMRT

Tools reference guide (v11.0)

Page 2

export-datasets Takes one or more PacBio dataset XML files and packages all contents into a

single ZIP archive. See “export-datasets” on page 35.

export-job Takes one SMRT Link Analysis job and packages all contents into a single ZIP

archive. See “export-job” on page 36.

gcpp Variant-calling tool which provides several variant-calling algorithms for PacBio

sequencing data. See “gcpp” on page 38.

Genome Assembly Generates de novo assemblies using HiFi reads.

See “Genome Assembly” on page 41.

HiFiViral SARS-

CoV-2 Analysis

Analyzes multiplexed viral surveillance samples for SARS-CoV-2.

See “HiFiViral SARS-CoV-2 Analysis” on page 49.

ipdSummary Detects DNA base-modifications from kinetic signatures.

See “ipdSummary” on page 52.

isoseq3 Characterizes full-length transcripts and generates full-length transcript isoforms,

eliminating the need for computational reconstruction. See “isoseq3” on page 56.

juliet A general-purpose minor variant caller that identifies and phases minor single

nucleotide substitution variants in complex populations. See “juliet” on page 60.

Microbial

Genome Analysis

Generates de novo assemblies of small prokaryotic genomes between 1.9-10 Mb

and companion plasmids between 2 – 220 kb and performs base modification

detection, using HiFi reads. See “Microbial Genome Analysis” on page 68.

motifMaker Identifies motifs associated with DNA modifications in prokaryotic genomes.

See “motifMaker” on page 73.

pbcromwell PacBio’s wrapper for the cromwell scientific workflow engine used to power

SMRT Link. For details on how to use pbcromwell to run workflows, see

“pbcromwell” on page 75.

pbindex Creates an index file that enables random access to PacBio-specific data in BAM

files. See “pbindex” on page 80.

pbmarkdup Marks or removes duplicates reads from CCS reads.

See “pbmarkdup” on page 81.

pbmm2 Aligns PacBio reads to reference sequences; a SMRT wrapper for minimap2.

See “pbmm2” on page 84.

pbservice Performs a variety of useful tasks within SMRT Link. See “pbservice” on page 91.

pbsv Structural variant caller for PacBio reads. See “pbsv” on page 96.

pbvalidate Validates that files produced by PacBio software are compliant with PacBio’s own

internal specifications. See “pbvalidate” on page 100.

primrose Performs motif-calling to detect

5mC CpG sites in HiFi BAM files.

See

“primrose” on page 103 for details.

runqc-reports Generates Run QC reports. See “runqc-reports” on page 104.

summarizeModifi

cations

Generates a GFF summary file from the output of base modification analysis

combined with the coverage summary GFF generated by resequencing pipelines.

See “summarize Modifications” on page 105.

Tool Description

Page 3

bam2fasta/

bam2fastq

The bam2fasta and bam2fastq tools convert PacBio BAM or Data Set files

into gzipped FASTA and FASTQ files, including demultiplexing of barcoded

data.

Usage

Both tools have an identical interface and take BAM and/or Data Set files

as input.

bam2fasta [options] <input>

bam2fastq [options] <input>

Examples

bam2fasta -o projectName m54008_160330_053509.subreads.bam

bam2fastq -o myEcoliRuns m54008_160330_053509.subreads.bam

m54008_160331_235636.subreads.bam

bam2fasta -o myHumanGenomem54012_160401_000001.subreadset.xml

Input files

•One or more *.bam files

•

*.subreadset.xml file (Data Set file)

Output files

• *.fasta.gz

• *.fastq.gz

Options Description

-h, --help Displays help information and exits.

--version Displays program version number and exits.

-o,--output Specifies the prefix of the output file names. - implies streaming.

Note: Streaming is not supported with compression or with the split_barcodes

option.

-c Specifies the Gzip compression level. Values are [1,2,3,4,5,6,7,8,9].

(Default = 1)

-u Specifies that the output FASTA/FASTQ files are not compressed. .gz is not

added to the output file names, and -c settings are ignored.

--split-barcodes Splits the output into multiple FASTA/FASTQ files, by barcode pairs.

Note: The bam2fasta/bam2fastq tools inspect the bc tag in the BAM file to

determine the 0-based barcode indices from their respective positions in the

barcode FASTA file.

-p,--seqid-prefix Specifies the prefix for the sequence IDs used in the FAST/FASTQ file headers.

Page 4

bamsieve

The bamsieve tool creates a subset of a BAM or PacBio Data Set file

based on either a list of hole numbers to include or exclude, or a

percentage of reads to be randomly selected, while keeping all subreads

within a read together. Although

bamsieve is BAM-centric, it has some

support for dataset XML and will propagate metadata, as well as scraps

BAM files in the special case of SubreadSets.

bamsieve is useful for

generating minimal test Data Sets containing a handful of reads.

bamsieve operates in two modes: list mode where the ZMWs to keep or

discard are explicitly specified, or percentage/count mode, where a

fraction of the ZMWs is randomly selected.

ZMWs may be listed in one of several ways:

• As a comma-separated list on the command line.

• As a flat text file, one ZMW per line.

• As another PacBio BAM or Data Set of any type.

Usage

bamsieve [-h] [--version] [--log-file LOG_FILE]

[--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL} | --debug | --quiet

| -v]

[--show-zmws][--include INCLUDE LIST] [--exclude EXCLUDE LIST]

[--percentage PERCENTAGE] [-n COUNT] [-s SEED]

[--ignore-metadata][--barcodes]

input_bam [output_bam]

Required Description

input_bam The name of the input BAM file or Data Set from which reads will be read.

output_bam The name of the output BAM file or Data Set where filtered reads will be written to.

(Default = None)

Options Description

-h, --help Displays help information and exits.

--version Displays program version number and exits.

--log-file LOG_FILE Writes the log to file. (Default = None, writes to stdout.)

--log-level Specifies the log level; values are [DEBUG, INFO, WARNING, ERROR, CRITICAL].

(Default = WARNING)

--debug Alias for setting the log level to DEBUG. (Default = False)

--quiet Alias for setting the log level to CRITICAL to suppress output. (Default = False)

-v, --verbose Sets the verbosity level. (Default = NONE)

--show-zmws Prints a list of ZMWs and exits. (Default = False)

--include INCLUDE LIST Specifies the ZMWs to include in the output. This can be a comma-separated list of

ZMWs, or a file containing a list of ZMWs (one hole number per line), or a BAM/

Data Set file. (Default = NONE)

Page 5

Examples

Pulling out two ZMWs from a BAM file:

bamsieve --include 111111,222222 full.subreads.bam sample.subreads.bam

Pulling out two ZMWs from a Data Set file:

bamsieve --include 111111,222222 full.subreadset.xml sample.subreadset.xml

Using a text list:

bamsieve --include zmws.txt full.subreads.bam sample.subreads.bam

Using another BAM or Data Set as a list:

bamsieve --include mapped.alignmentset.xml full.subreads.bam mappable.subreads.bam

Generating a list of ZMWs from a Data Set:

bamsieve --show-zmws mapped.alignmentset.xml > mapped_zmws.txt

Anonymizing a Data Set:

bamsieve --include zmws.txt --ignore-metadata --anonymize full.subreadset.xml

anonymous_sample.subreadset.xml

Removing a read:

bamsieve --exclude 111111 full.subreadset.xml filtered.subreadset.xml

Selecting 0.1% of reads:

bamsieve --percentage 0.1 full.subreads.bam random_sample.subreads.bam

Selecting a different 0.1% of reads:

bamsieve --percentage 0.1 --seed 98765 full.subreads.bam random_sample.subreads.bam

--exclude EXCLUDE LIST Specifies the ZMWs to exclude from the output. This can be a comma-separated

list of ZMWs, or a file containing a list of ZMWs (one hole number per line), or a

BAM/Data Set file that specifies ZMWs. (Default = NONE)

--percentage PERCENTAGE Specifies a percentage of a SMRT

Cell to recover (Range = 1-100) rather than a

specific list of reads. (Default = NONE)

-n COUNT, --count COUNT Specifies a specific number of ZMWs picked at random to recover.

(Default = NONE)

-s SEED, --seed SEED Specifies a random seed for selecting a percentage of reads. (Default = NONE)

--ignore-metadata Discard the input Data Set metadata. (Default = False)

--barcodes Specifies that the include/exclude list contains barcode indices instead of ZMW

numbers. (Default = False)

Options Description

Page 6

Selecting just two ZMWs/reads at random:

bamsieve --count 2 full.subreads.bam two_reads.subreads.bam

Selecting by barcode:

bamsieve --barcodes --include 4,7 full.subreads.bam two_barcodes.subreads.bam

Generating a tiny BAM file that contains only mappable reads:

bamsieve --include mapped.subreads.bam full.subreads.bam mappable.subreads.bam

bamsieve --count 4 mappable.subreads.bam tiny.subreads.bam

Splitting a Data Set into two halves:

bamsieve --percentage 50 full.subreadset.xml split.1of2.subreadset.xml

bamsieve --exclude split.1of2.subreadset.xml full.subreadset.xml

split.2of2.subreadset.xml

Extracting Unmapped Reads:

bamsieve --exclude mapped.alignmentset.xml movie.subreadset.xml unmapped.subreadset.xml

Page 7

ccs

Circular Consensus Sequencing (CCS) Analysis computes consensus

sequences from multiple “passes” around a circularized single DNA

molecule (SMRTbell

template). CCS analysis uses the Arrow framework

to achieve optimal consensus results given the number of passes

available.

CCS analysis workflow

1. Initial filtering

– Filter ZMWs: Remove ZMWs with signal-to-noise ratio (SNR) below

--min-snr.

– Filter subreads: Remove subreads with lengths <50% or >200% of

the median subread length. Stop if the number of full-length

subreads is fewer than

--min-passes.

2. Generate draft

– The polish stage iteratively improves upon a candidate template

sequence. Because polishing is very compute-intensive, it is

desirable to start with a template that is as close as possible to the

true sequence of the molecule to reduce the number of iterations

until convergence. The

ccs software does not pick a full-length

subread as the initial template to be polished, but instead generates

an approximate draft consensus sequence using our improved

implementation of the

Sparc graph consensus algorithm. This

algorithm depends on a subread-to-backbone alignment that is

generated by the

pancake mapper developed by PacBio, using

edlib as the core aligner. Typically, subreads have accuracy of

around 90% and the draft consensus has a higher accuracy, but is

still below 99%.

– Stop if the draft length is shorter than

--min-length and longer

than

--max-length.

3. Alignment

– Align subreads to the draft consensus using

pancake with KSW2 for

downstream windowing and filtering.

Page 8

4. Windowing

– Divide the subread-to-draft alignment into overlapping windows with

a target size of 22 bp with ±2 bp overlap. Avoid breaking windows at

simple repeats (homopolymers to 4-mer repeats) to reduce edge

cases at window borders. Windowing reduces the algorithm run

time from quadratic to linear in the insert size.

5. Single-strand artifacts

– Identify heteroduplexes, where one strand of the SMRTbell differs

significantly from the reverse complement of the other strand.

Subread orientation is inferred from the alignment. Small

heteroduplexes, such as single base

A paired with a matching G, are

retained and the ambiguity is reflected in base quality. Molecules

with a single difference longer than 20 bp between the strands are

removed and recorded as heteroduplexes in the

<outputPrefix>.ccs_report.txt file.

6. Trim large insertions

– Insertions in the subreads relative to the draft that are longer than

--max-insertion-size, default 30 bp, are trimmed since they

typically represented spurious sequencing activity.

7. Filter candidates

– For each window, a heuristic is used to find those positions that

likely need polishing. In addition, homopolymers are always

polished. Skipping unambiguous positions makes the polishing at

least twice as fast.

8. Polishing

– The core polishing uses the

arrow algorithm, a left-right Hidden

Markov-Model (HMM) that models the enzymatic and photophysical

events in SMRT sequencing. Emission and transition parameters are

estimated by a dinucleotide template context. Transition

parameters form a homogeneous Markov chain. The transition

parameters do not depend on the position within the template, only

the pulse width of a base call, the dinucleotide context of the

template, and the SNR of the ZMW.

Arrow computes the log-

likelihood that the subread originates from the template,

marginalizing over all possible alignments of the subread to the

template. For every position in the template that is a candidate for

polishing,

arrow tests if the log-likelihood is improved by

substituting one of the other three nucleotides, inserting one of the

four nucleotides after the position, or deleting the position itself.

Once

arrow does not find any further beneficial mutations to the

template in an iteration, it stops.

9. QV calculation

– The log-likelihood ratio between the most likely template sequence

and all of its mutated counterparts is used to calculate a quality for

each base in the final consensus. The average of the per-base

qualities is the read accuracy,

rq.

10. Final steps

Page 9

– The per-window consensus template sequences and base qualities

are concatenated and overhangs (overlaps between adjacent

windows) are trimmed. If the predicted read accuracy is at least

--min-rq, then the consensus read is written to the output.

Input files

•One .subreads.bam file containing the subreads for each SMRTbell

template sequenced.

Output files

• A BAM file with one entry for each consensus sequence derived from a

productive ZMW. BAM is a general file format for storing sequence

data, which is described fully by the SAM/BAM working group. The

CCS analysis output format is a version of this general format, where

the consensus sequence is represented by the "Query Sequence".

Several tags were added to provide additional meta information. An

example BAM entry for a consensus as seen by

samtools is shown

below.

m64009_201008_223950/1/ccs 4 * 0 255 * * 0 0 ATCGCCTACC

~|~t~R~~r~ RG:Z:a773c1f2 fi:B:C,26,60,21,41,33,26,63,45,73,33 fn:i:6

fp:B:C,11,18,21,35,8,18,31,8,23,11 np:i:12

ri:B:C,17,37,24,4,70,21,12,44,21,32 rn:i:6 rp:B:C,16,56,17,9,23,19,10,10,23,12

rq:f:0.999651 sn:B:f,10.999,16.2603,3.964,7.17746 we:i:9816064 ws:i:20

zm:i:1

Following are some of the common fields contained in the output BAM

file:

Field Description

Query Name Movie Name / ZMW # /ccs

FLAG Required by the format but meaningless in this context. Always set to

4 to indicate the

read is unmapped.

Reference Name Required by the format but meaningless in this context. Always set to *.

Mapping Start Required by the format but meaningless in this context. Always set to 0.

Mapping Quality Required by the format but meaningless in this context. Always set to 255.

CIGAR Required by the format but meaningless in this context. Always set to *.

RNEXT Required by the format but meaningless in this context. Always set to *.

PNEXT Required by the format but meaningless in this context. Always set to 0.

TLEN Required by the format but meaningless in this context. Always set to 0.

Consensus Sequence The consensus sequence generated.

Quality Values The per-base parametric quality metric. For details see “Interpreting QUAL values” on

page 12.

RG Tag The read group identifier.

bc Tag A 2-entry array of upstream-provided barcode calls for this ZMW.

bq Tag The quality of the barcode call. (Optional: Depends on barcoded inputs.)

Page 10

Usage

ccs [OPTIONS] INPUT OUTPUT

Example

ccs --all myData.subreads.bam myResult.bam

np Tag The number of full passes that went into the subread. (Optional: Depends on barcoded

inputs.)

rq Tag The predicted read quality.

zm Tag The ZMW hole number.

ma Tag Bitmask storing whether a SMRTbell adapter is missing on either side of the molecule

that produced a CCS read. 0 indicates that neither end has a confirmed missing adapter.

ac Tag An array containing the counts of detected and missing SMRTbell adapters on either

side of the molecule that produced a CCS read:

• Detected adapters on left/start

• Missing adapters on left/start

• Detected adapters on right/end

• Missing adapter on right/end

Field Description

Required Description

Input File Name The name of a single subreads.bam or a subreadset.xml file to be processed.

(Example = myData.subreads.bam)

Output File Name The name of the output BAM file; comes after all other options listed. Valid output

files are the BAM and the Dataset .xml formats. (Example = myResult.bam)

Options Description

--version Prints the version number.

--report-file Contains a result tally of the outcomes for all ZMWs that were processed. If no file

name is given, the report is output to the file ccs_report.txt. In addition to the

count of successfully-produced consensus sequences, this file lists how many

ZMWs failed various data quality filters (SNR too low, not enough full passes, and

so on) and is useful for diagnosing unexpected drops in yield.

--min-snr Removes data that is likely to contain deletions. SNR is a measure of the strength

of signal for all 4 channels (A, C, G, T) used to detect base pair incorporation. This

value sets the threshold for minimum required SNR for any of the four channels.

Data with SNR < 2.5 is typically considered lower quality. (Default = 2.5)

--min-length Specifies the minimum length requirement for the minimum length of the draft

consensus to be used for further polishing. If the targeted template is known to be a

particular size range, this can filter out alternative DNA templates. (Default = 10)

--max-length Specifies the maximum length requirement for the maximum length of the draft

consensus to be used for further polishing. For robust results while avoiding

unnecessary computation on unusual data, set to ~20% above the largest expected

insert size. (Default = 50000)

--min-passes Specifies the minimum number of passes for a ZMW to be emitted. This is the

number of full passes. Full passes must have an adapter hit before and after the

insert sequence and so do not include any partial passes at the start and end of the

sequencing reaction. It is computed as the number of passes mode across all

windows. (Default = 3)

Page 11

--min-rq Specifies the minimum predicted accuracy of a read. CCS analysis generates an

accuracy prediction for each read, defined as the expected percentage of matches

in an alignment of the consensus sequence to the true read. A value of 0.99

indicates that only reads expected to be 99% accurate are emitted.

(Default = 0.99)

--num-threads Specifies how many threads to use while processing. By default, CCS analysis uses

as many threads as there are available cores to minimize processing time, but

fewer threads can be specified here.

--log-file The name of a log file to use. If none is given, the logging information is printed to

STDERR. (Example: mylog.txt)

--log-level Specifies verbosity of log data to produce. By setting --logLevel=DEBUG, you

can obtain detailed information on what ZMWs were dropped during processing, as

well as any errors which may have appeared. (Default = INFO)

--skip-polish After constructing the draft consensus, do not proceed with the polishing steps.

This is significantly faster, but generates less accurate data with no RQ or QUAL

values associated with each base.

--by-strand Separately generates a consensus sequence from the forward and reverse strands.

Useful for identifying heteroduplexes formed during sample preparation.

--chunk Operates on a single chunk. Format i/N, where i in [1,N]. Examples: 3/24 or 9/9.

--max-chunks Determines the maximum number of chunks, given an input file.

--model-path Specifies the path to a model file or directory containing model files.

--model-spec Specifies the name of the chemistry or model to use, overriding the default

selection.

--all Generates one representative sequence per ZMW, irrespective of quality and

passes. --min-passes 0 --min-rq 0 --max-length 0 are set implicitly and

cannot be changed; --all also deactivates the maximum draft length filter.

Filtering has to be performed downstream.

The ccs --all option changes the workflow as follows:

1. There is special behavior for low-pass ZMWs. If a ZMW has fewer than 2 full-

length subreads, use the subread of median length as representative

consensus, optionally with its kinetic information as forward orientation using

--all-kinetics, and do not polish.

2. Only polish ZMWs with at least two full-length subreads mapping back to the

draft. Otherwise, set predicted accuracy rq tag to -1 to indicate that the

predicted accuracy was not calculated, and populate per-base QVs with + (QV

10) the approximate raw accuracy. Kinetic information is not available for

unpolished drafts.

3. Instead of using an unpolished draft without kinetic information as a

representative consensus sequence, if --subread-fallback is used, fall

back to a representative subread with kinetic information.

How is --all different from explicitly setting --min-passes 0 --min-rq 0?

• Setting --min-passes 0 --min-rq 0 is a brute-force combination that

polishes every ZMW, even those that only have one partial subread, with

polishing making no difference. In contrast, --all is a bit smarter and only

polishes ZMWs with at least one full-length subread and one additional partial

subread.

Options Description

Page 12

Interpreting QUAL values

The QUAL value of a read is a measure of the posterior likelihood of an

error at a particular position. Increasing QUAL values are associated with

a decreasing probability of error. For indels and homopolymers, there is

ambiguity as to which QUAL value is associated with the error probability.

Shown below are different types of alignment errors, with an

* indicating

which sequence BP should be associated with the alignment error.

Mismatch

ccs: ACGTATA

ref: ACATATA

Deletion

ccs: AC-TATA

ref: ACATATA

Insertion

ccs: ACGTATA

ref: AC-TATA

Homopolymer insertion or deletion

Indels should always be left-aligned, and the error probability is only given

for the first base in a homopolymer.

* *

ccs: ACGGGGTATA ccs: AC-GGGTATA

ref: AC-GGGTATA ref: ACGGGGTATA

--hifi-kinetics Generates averaged kinetic information for polished reads, independently for both

strands of the insert. Forward is defined with respect to the orientation represented

in SEQ and is considered to be the native orientation. As with other PacBio-specific

tags, aligners will not re-orient these fields.

Base modifications can be inferred from per-base pulse width (PW) and inter-pulse

duration (IPD) kinetics.

Minor cases exist where a certain orientation may get filtered out entirely from a

ZMW, preventing valid values from being passed for that record. In these cases,

empty lists are passed for the respective record/orientation, and number of passes

are set to zero.

To facilitate the use of HiFi reads with base modifications workflows, we added an

executable in pbbam called ccs-kinetics-bystrandify which creates a

pseudo --by-strand BAM with corresponding pw and ip tags that imitates a

normal, unaligned subreads BAM.

--all-kinetics Adds kinetic information for all ZMWs, except for unpolished draft consensus.

--subread-fallback When combined with --all, uses a subread instead of a draft as representative

consensus.

--suppress-reports Suppresses the generation of default reports and metric files.

Options Description

Page 13

CCS Analysis Yield report

The CCS Analysis Yield report specifies the number of ZMWs that

successfully produced consensus sequences, as well as a count of how

many ZMWs did not produce a consensus sequence for various reasons.

The entries in this report, as well as parameters used to increase or

decrease the number of ZMWs that pass various filters, are shown in the

table below.

The first part is a summary of inputs and outputs:

The second part explains in detail the exclusive ZMW count for those

ZMWs that were filtered:

ZMW results Parameters affecting results Description

ZMWs input None The number of input ZMWs.

ZMWs pass filters All custom processing settings The number of CCS reads successfully produced on

the first attempt, using the fast windowed approach.

ZMWs fail filters All custom processing settings The number of ZMWs reads that failed to produce a

CCS read.

ZMWs shortcut filters -all The number of ZMWs having fewer than 2 full-length

subreads.

ZMWs with tandem

repeats

--min-tandem-repeat-

length

The number of ZMWs with repeats larger than the

specified threshold.

ZMW results

Parameters affecting

results

Description

No usable subreads None The ZMW had no usable subreads. Either there were no

subreads, or all subreads had lengths outside the range

<50% or >200% of the median subread length.

Below SNR threshold --min-snr The ZMW had at least one channel's SNR below the

minimum threshold.

Lacking full passes --min-passes There were not enough subreads that had an adapter at

the start and end of the subread (a "full pass").

Heteroduplexes None The SMRTbell contains a heteroduplex. In this case, it is

not clear what the consensus should be and so the ZMW is

dropped.

Min coverage violation None The ZMW is damaged on one strand and cannot be

polished reliably.

Draft generation error None Subreads do not match the generated draft sequence,

even after multiple tries.

Draft above

--max-length

--max-length The draft sequence was above the maximum length

threshold.

Draft below

--min-length

--min-length The draft sequence was below the minimum length

threshold.

Lacking usable

subreads

None Too many subreads were dropped while polishing.

CCS analysis did not

converge

None The consensus sequence did not converge after the

maximum number of allowed rounds of polishing.

Page 14

How do I read the ccs_report.txt file?

By default, each CCS analysis generates a

ccs_report.txt file. This file

summarizes how many ZMWs generated HiFi reads and how many ZMWs

failed CCS reads generation because of the listed causes. For those

failing, each ZMW contributes to exactly one reason of failure;

percentages are with respect to number of failed ZMWs.

Does CCS analysis dislike low-complexity regions?

Low-complexity comes in many shapes and forms. A particular challenge

for CCS analysis are highly-enriched tandem repeats, such as hundreds of

copies of AGGGGT. Prior to

ccs v5.0, inserts with many copies of a small

repeat likely did not generate a consensus sequence. Since

ccs v5.0, every

ZMW is tested if it contains a tandem repeat of length

--min-tandem-repeat-length 1000. For this, we use symmetric DUST,

specifically this

sdust implementation, but slightly modified. If a ZMW is

flagged as a tandem repeat, internally

--disable-heuristics is

activated for only this ZMW, and various filters that are known to exclude

low-complexity sequences are disabled. This recovers most of the low-

complexity consensus sequences, without impacting run time

performance.

Can I produce one consensus sequence for each strand of a molecule?

Yes, use

--by-strand. Make sure that you have sufficient coverage, as

--min-passes are per-strand in this case. For each strand, CCS analysis

generates one consensus read that has to pass all filters. Read name

suffix indicates strand. Example:

m64011_190714_120746/14/ccs/rev

m64011_190714_120746/35/ccs/fwd

How does --by-strand work? For each ZMW:

• Determine orientation of reads with respect to the one closest to the

median length.

• Sort reads into two buckets, forward and reverse strands.

• Treat each strand as an individual entity as we do with ZMWs:

– Apply all filters per strand individually.

– Create a draft for each strand.

CCS read below

minimum predicted

accuracy

--min-rq Each CCS read has a predicted level of accuracy

associated with it. Reads that are below the minimum

specified threshold are removed.

Unknown error during

processing

None These should not occur.

ZMW results

Parameters affecting

results

Description

Page 15

– Polish each strand.

– Write out each polished strand consensus.

BAM tags generated

Tag Type Description

ec f Effective coverage

fi B,C Forward IPD (Codec V1)

fn i Forward number of complete passes (zero or more)

fp B,C Forward PulseWidth (Codec V1)

np i Number of full-length subreads

ri B,C Reverse IPD (Codec V1)

rn i Reverse number of complete passes (zero or more)

rp B,C Reverse PulseWidth (Codec V1)

rq f Predicted average read accuracy

sn B,F Signal-to-noise ratios for each nucleotide

zm i ZMW hole number

RG z Read group

Page 16

dataset

The dataset tool creates, opens, manipulates and writes Data Set XML

files. The commands allow you to perform operations on the various types

of data held by a Data Set XML: Merge, split, write, and so on.

Usage

dataset [-h] [--version] [--log-file LOG_FILE]

[--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL} | --debug | --quiet | -v]

[--strict] [--skipCounts]

{create,filter,merge,split,validate,summarize,consolidate,loadstats,newuuid,loadmetada

ta,copyto,absolutize,relativize}

create Command: Create an XML file from a fofn (file-of-file names) or

BAM file. Possible types:

SubreadSet, AlignmentSet, ReferenceSet,

HdfSubreadSet, BarcodeSet, ConsensusAlignmentSet,

ConsensusReadSet, ContigSet

dataset create [-h] [--type DSTYPE] [--name DSNAME] [--generateIndices]

[--metadata METADATA] [--novalidate] [--relative]

outfile infile [infile ...]

Example

The following example shows how to use the dataset create command

to create a barcode file:

dataset create --generateIndices --name my_barcodes --type BarcodeSet

my_barcodes.barcodeset.xml my_barcodes.fasta

Options Description

-h, --help Displays help information and exits.

<Command> -h Displays help for a specific command.

-v, --version Displays program version number and exits.

--log-file LOG_FILE Writes the log to file. (Default = None, writes to stdout.)

--log-level Specifies the log level; values are [DEBUG, INFO, WARNING, ERROR,

CRITICAL]. (Default = INFO)

--debug Alias for setting the log level to DEBUG. (Default = False)

--quiet Alias for setting the log level to CRITICAL to suppress output.

(Default = False)

-v Sets the verbosity level. (Default = NONE)

--strict Turns on strict tests and display all errors. (Default = False)

--skipCounts Skips updating NumRecords and TotalLength counts.

(Default = False)

Required Description

outfile The name of the XML file to create.

infile The fofn (file-of-file-names) or BAM file(s) to convert into an XML file.

Page 17

filter

Command: Filter an XML file using filters and threshold values.

• Suggested filters:

alignedlength, as, astart, bc, bcf, bcq, bcr,

bq,

cx, length, mapqv, movie, n_subreads, pos, qend, qid,

qname, qstart, readstart, rname, rq, tend, tstart, zm

• More resource-intensive filter: [qs]

Note: Multiple filters with different names are ANDed together. Multiple

filters with the same name are ORed together, duplicating existing

requirements. The filter string should be enclosed in single quotes.

dataset filter [-h] infile outfile filters [filters ...]

Examples

Filter on read quality > 0.99 (Q20):

% dataset filter in.consensusreadset.xml hifi.consensusreadset.xml 'rq >= 0.99'

Filter on read quality and length:

% dataset filter in.consensusreadset.xml filtered.consensusreadset.xml 'rq >= 0.99 AND

length >= 10000'

Filter for very long and very short reads:

% dataset filter in.consensusreadset.xml filtered.consensusreadset.xml 'length >=

40000; length <= 400'

Filter for specific high-quality barcodes:

% dataset filter mixed.consensusreadset.xml samples1-3.consensusreadset.xml 'bc =

[0,1,2] AND bq >= 26'

Options Description

--type DSTYPE Specifies the type of XML file to create. (Default = NONE)

--name DSNAME The name of the new Data Set XML file.

--generateIndices Generates index files (.pbi and .bai for BAM, .fai for FASTA). Requires

samtools/pysam and pbindex. (Default = FALSE)

--metadata METADATA A metadata.xml file (or Data Set XML) to supply metadata.

(Default = NONE)

--novalidate Specifies not to validate the resulting XML. Leaves the paths as they are.

--relative Makes the included paths relative instead of absolute. This is not

compatible with --novalidate.

Required Description

infile The name of the XML file to filter.

outfile The name of the output filtered XML file.

filters The values to filter on. (Example: rq>0.85)

Page 18

merge

Command: Combine XML files.

dataset merge [-h] outfile infiles [infiles ...]

split Command: Split a Data Set XML file.

dataset split [-h] [--contigs] [--barcodes] [--zmws] [--byRefLength]

[--noCounts] [--chunks CHUNKS] [--maxChunks MAXCHUNKS]

[--targetSize TARGETSIZE] [--breakContigs]

[--subdatasets] [--outdir

infile [outfiles...]

validate Command: Validate XML and ResourceId files. (This is an

internal testing functionality that may be useful.)

Note: This command requires that

pyxb (not distributed with SMRT Link)

be installed. If not installed,

validate simply checks that the files pointed

to in

ResourceIds exist.

Required Description

infiles The names of the XML files to merge.

outfile The name of the output XML file.

Required Description

infile The name of the XML file to split.

Options Description

outfiles The names of the resulting XML files.

--contigs Splits the XML file based on contigs. (Default = FALSE)

--barcodes Splits the XML file based on barcodes. (Default = FALSE)

--zmws Splits the XML file based on ZMWs. (Default = FALSE)

--byRefLength Splits contigs by contig length. (Default = TRUE)

--noCounts Updates the Data Set counts after the split. (Default = FALSE)

--chunks x Splits contigs into x total windows. (Default = 0)

--maxChunks x Splits the contig list into at most x groups. (Default = 0)

--targetSize x Specifies the minimum number of records per chunk. (Default = 5000)

--breakContigs Breaks contigs to get closer to maxCounts. (Default = False)

--subdatasets Splits the XML file based on sub-datasets. (Default = False)

--outdir OUTDIR Specifies an output directory for the resulting XML files.

(Default = <in-place>, not the current working directory.)

Page 19

dataset validate [-h] [--skipFiles] infile

summarize Command: Summarize a Data Set XML file.

dataset summarize [-h] infile

consolidate Command: Consolidate XML files.

dataset consolidate [-h] [--numFiles NUMFILES] [--noTmp]

infile datafile xmlfile

loadstats Command: Load an sts.xml file containing pipeline statistics

into a Data Set XML file.

dataset loadstats [-h] [--outfile OUTFILE] infile statsfile

Required Description

infile The name of the XML file to validate.

Options Description

--skipFiles Skips validating external resources. (Default = False)

Required Description

infile The name of the XML file to summarize.

Required Description

infile The name of the XML file to consolidate.

datafile The name of the resulting data file.

xmlfile The name of the resulting XML file.

Options Description

--numFiles x Specifies the number of data files to produce. (Default = 1)

--noTmp Do not copy to a temporary location to ensure local disk use.

(Default = False)

Required Description

infile The name of the Data Set XML file to modify.

statsfile The name of the .sts.xml file to load.

Options Description

--outfile OUTFILE The name of the XML file to output. (Default = None)

Page 20

newuuid

Command: Refresh a Data Set's Unique ID.

dataset newuuid [-h] [--random] infile

loadmetadata Command: Load a .metadata.xml file into a Data Set XML

file.

dataset loadmetadata [-h] [--outfile OUTFILE] infile metadata

copyto Command: Copy a Data Set and resources to a new location.

dataset copyto [-h] [--relative] infile outdir

absolutize Command: Make the paths in an XML file absolute.

dataset absolutize [-h] [--outdir OUTDIR] infile

Required Description

infile The name of the XML file to refresh.

Options Description

--random Generates a random UUID, instead of a hash. (Default = False)

Required Description

infile The name of the Data Set XML file to modify.

metadata The .metadata.xml file to load, or Data Set to borrow from.

Options Description

--outfile OUTFILE Specifies the XML file to output. (Default = None)

Required Description

infile The name of the XML file to copy.

outdir The directory to copy to.

Options Description

--relative Makes the included paths relative instead of absolute. (Default = False)

Required Description

infile The name of the XML file whose paths should be absolute.

Options Description

--outdir OUTDIR Specifies an optional output directory. (Default = None)

Page 21

relativize

Command: Make the paths in an XML file relative.

dataset relativize [-h] infile

Examples - Filter reads

To filter one or more BAM file’s worth of subreads, aligned or otherwise,

and then place them into a single BAM file:

# usage: dataset filter <in_fn.xml> <out_fn.xml> <filters>

dataset filter in_fn.subreadset.xml filtered_fn.subreadset.xml 'rq>0.85'

# usage: dataset consolidate <in_fn.xml> <out_data_fn.bam> <out_fn.xml>

dataset consolidate filtered_fn.subreadset.xml consolidate.subreads.bam

out_fn.subreadset.xml

The filtered Data Set and the consolidated Data Set should be read-for-

read equivalent when used with SMRT

Analysis software.

Example - Resequencing pipeline

• Align two movie’s worth of subreads in two SubreadSets to a

reference.

• Merge the subreads together.

• Split the subreads into Data Set chunks by contig.

•Process using

gcpp on a chunkwise basis (in parallel).

1. Align each movie to the reference, producing a Data Set with one BAM

file for each execution:

pbalign movie1.subreadset.xml referenceset.xml movie1.alignmentset.xml

pbalign movie2.subreadset.xml referenceset.xml movie2.alignmentset.xml

2. Merge the files into a FOFN-like Data Set; BAMs are not touched:

# dataset merge <out_fn> <in_fn> [<in_fn> <in_fn> ...]

dataset merge merged.alignmentset.xml movie1.alignmentset.xml movie2.alignmentset.xml

3. Split the Data Set into chunks by contig name; BAMs are not touched:

– Note that supplying output files splits the Data Set into that many

output files (up to the number of contigs), with multiple contigs per

file.

– Not supplying output files splits the Data Set into one output file per

contig, named automatically.

– Specifying a number of chunks instead will produce that many files,

with contig or even subcontig (reference window) splitting.

dataset split --contigs --chunks 8 merged.alignmentset.xml

Required Description

infile The name of the XML file whose paths should be relative.

Page 22

4. Process the chunks:

gcpp --reference referenceset.xml --output

chunk1consensus.fasta,chunk1consensus.fastq,chunk1consensus.vcf,chunk1consensus.gff

chunk1contigs.alignmentset.xml

The chunking works by duplicating the original merged Data Set (no BAM

duplication) and adding filters to each duplicate such that only reads

belonging to the appropriate contigs are emitted. The contigs are

distributed among the output files in such a way that the total number of

records per chunk is about even.

Page 23

Demultiplex

Barcodes

The Demultiplex Barcodes application identifies barcode sequences in

PacBio single-molecule sequencing data.

Demultiplex Barcodes can demultiplex samples that have a unique per-

sample barcode pair and were pooled and sequenced on the same SMRT

Cell. There are four different methods for barcoding samples with PacBio

technology:

1. Sequence-specific primers

2. Barcoded universal primers

3. Barcoded adapters

4. Linear Barcoded Adapters for Probe-based Captures

Page 24

In addition, there are three different barcode library designs. Demultiplex

Barcodes supports raw subread and CCS reads demultiplexing.

In the overview above, the input sequence is flanked by adapters on both

sides. The bases adjacent to an adapter are barcode regions. A read can

have up to two barcode regions, leading and trailing. Either or both adapt-

ers can be missing and consequently the leading and/or trailing region is

not being identified.

For symmetric and tailed library designs, the same barcode is attached to

both sides of the insert sequence of interest. The only difference is the

orientation of the trailing barcode. For barcode identification, one read

with a single barcode region is sufficient.

For the asymmetric design, different barcodes are attached to the sides of

the insert sequence of interest. To identify the different barcodes, a read

with leading and trailing barcode regions is required.

Output barcode pairs are generated from the identified barcodes. The

barcode names are combined using “

--“, for example bc1002--bc1054.

The sort order is defined by the barcode indices, starting with the lowest.

Workflow

By default, Demultiplex Barcodes processes input reads grouped by ZMW,

except if the

--per-read option is used. All barcode regions along the

read are processed individually. The final per-ZMW result is a summary

over all barcode regions. Each ZMW is assigned to a pair of selected

barcodes from the provided set of candidate barcodes. Subreads from the

same ZMW will have the same barcode and barcode quality. For a

particular target barcode region, every barcode sequence gets aligned as

given and as reverse-complement, and higher scoring orientation is

chosen. This results in a list of scores over all candidate barcodes.

•If only same barcode pairs are of interest (symmetric/tailed), use the

--same option to filter out different barcode pairs.

Page 25

•If only different barcode pairs are of interest (asymmetric), use the

--different option to require at least two barcodes to be read, and

remove pairs with the same barcode.

Parameter presets

Recommended parameter combinations are available using

--preset for

HiFi input:

• HIFI-SYMMETRIC

--ccs --min-score 80 --min-end-score 50 --min-ref-span 0.75

--same

• HIFI-ASYMMETRIC

--ccs --min-score 80 --min-end-score 50 --min-ref-span 0.75

--different --min-scoring-regions 2

•NONE (Default)

Half adapters

For an adapter call with only one barcode region, the high-quality region

finder cuts right through the adapter. The preceding or succeeding

subread was too short and was removed, or the sequencing reaction

started/stopped there. This is called a half adapter. Thus, there are also

1.5, 2.5, N+0.5 adapter calls.

ZMWs with half or only one adapter can be used to identify the same

barcode pairs; positive-predictive value might be reduced compared to

high adapter calls. For asymmetric designs with different barcodes in a

pair, at least a single full-pass read is required; this can be two adapters,

two half adapters, or a combination.

Usage:

• Any existing output files are overwritten after execution.

•Always use

--peek-guess to remove spurious barcode hits.

Analysis of subread data:

lima movie.subreads.bam barcodes.fasta prefix.bam

lima movie.subreadset.xml barcodes.barcodeset.xml prefix.subreadset.xml

Analysis of CCS reads:

lima --css movie.ccs.bam barcodes.fasta prefix.bam

lima --ccs movie.consensusreadset.xml barcodes.barcodeset.xml

prefix.consensusreadset.xml

If you do not need to import the demultiplexed data into SMRT Link, use

the

--no-pbi option to minimize memory consumption and run time.

Symmetric or tailed options:

Raw: --same

CCS read: --same --ccs

Page 26

Asymmetric options:

Raw: --different

CCS reads: --different --ccs

Example execution:

lima m54317_180718_075644.subreadset.xml \

Sequel_RSII_384_barcodes_v1.barcodeset.xml \

m54317_180718_075644.demux.subreadset.xml \

--different --peek-guess

Options Description

--same Retains only reads with the same barcodes on both ends of the insert

sequence, such as symmetric and tailed designs.

--different Retains only reads with different barcodes on both ends of the insert

sequence, asymmetric designs. Enforces --min-passes ≥ 1.

--min-length n Omits reads with lengths below n base pairs after demultiplexing. ZMWs

with no reads passing are omitted. (Default = 50)

--max-input-length n Omits reads with lengths above n base pairs for scoring in the

demultiplexing step. (Default = 0, deactivated)

--min-score n Omits ZMWs with average barcode scores below n. A barcode score

measures the alignment between a barcode attached to a read and an

ideal barcode sequence, and is an indicator how well the chosen barcode

pair matches. It is normalized to a range between 0 (no hit) and 100 (a

perfect match).

(Default = 0, PacBio recommends setting it to 26.)

--min-end-score n Specifies the minimum end barcode score threshold applied to the

individual leading and trailing ends. (Default = 0)

--min-passes n Omits ZMWs with less than n full passes, a read with a leading and trailing

adapter. (Default = 0, no full-pass needed) Example:

0 pass : insert - adapter - insert

1 pass : insert - adapter - INSERT - adapter - insert

2 passes: insert - adapter - INSERT - adapter - INSERT -

adapter - insert

--score-full-pass Uses only reads flanked by adapters on both sides (full-pass reads) for

barcode identification.

--min-ref-span Specifies the minimum reference span relative to the barcode length.

(Default = 0.5)

--per-read Scores and tags per subread, instead of per ZMW.

--ccs Sets defaults to -A 1 -B 4 -D 3 -I 3 -X 1.

--peek n Looks at the first n ZMWs of the input and return the mean. This lets you

test multiple test barcode.fasta files and see which set of barcodes

was used.

--guess n This performs demultiplexing twice. In the first iteration, all barcodes are

tested per ZMW. Afterwards, the barcode occurrences are counted and

their mean is tested against the threshold n; only those barcode pairs that

pass this threshold are used in the second iteration to produce the final

demultiplexed output. A prefix.lima.guess file shows the decision

process; --same is being respected.

--guess-min-count Specifies the minimum ZMW count to whitelist a barcode. This filter is

ANDed with the minimum barcode score specified by --guess.

(Default = 0)

Page 27

Input files

Input data in PacBio-enhanced BAM format is either:

• Sequence data - Unaligned subreads, directly from Sequel II systems

and Sequel IIe systems.

• Unaligned CCS reads, generated by CCS analysis.

Barcodes are provided as a FASTA file or BarcodeSet file:

• One entry per barcode sequence.

• No duplicate sequences.

--peek-guess Sets the following options:

--peek 50000 --guess 45 --guess-min-count 10.

Demultiplex Barcodes will run twice on the input data. For the first 50,000

ZMWs, it will guess the barcodes and store the mask of identified

barcodes. In the second run, the barcode mask is used to demultiplex all

ZMWs.

If combined with --ccs then the barcode score threshold is increased by

--guess 75.

--single-side Identifies barcodes in molecules that only have barcodes adjacent to one

adapter.

--window-size-mult

--window-size-bp

The candidate region size multiplier: barcode_length * multiplier.

(Default = 3)

Optionally, you can specify the region size in base pairs using

--window-size-bp. If set, --window-size-mult is ignored.

--num-threads n Spawns n threads; 0 means use all available cores. This option also

controls the number of threads used for BAM and PBI compression.

(Default = 0)

--chunk-size n Specifies that each thread consumes n ZMWs per chunk for processing.

(Default = 10).

--no-bam Does not produce BAM output. Useful if only reports are of interest, as run

time is shorter.

--no-pbi Does not produce a .bam.pbi index file. The on-the-fly .bam.pbi file

generation buffers the output data. If you do not need a .bam.pbi index

file for SMRT Link import, use this option to decrease memory usage to a

minimum and shorten the run time.

--no-reports Does not produce any reports. Useful if only demultiplexed BAM files are

needed.

--dump-clips Outputs all clipped barcode regions generated to the

<prefix>.lima.clips file.

--dump-removed Outputs all records that did not pass the specified thresholds, or are

without barcodes, to the <prefix>.lima.removed.bam file.

--split-bam

--split-bam-named

Specifies that each barcode has its own BAM file called

prefix.idxBest-idxCombined.bam, such as prefix.0-0.bam.

Optionally ,--split-bam-named names the files by their barcode names

instead of their barcode indices.

--isoseq Removes primers as part of the Iso-Seq

pipeline.

See “Demultiplexing Iso-Seq

data” on page 32 for details.

--bad-adapter-ratio n Specifies the maximum ratio of bad adapters. (Default = 0).

Options Description

Page 28

• All bases must be in upper-case.

• Orientation-agnostic (forward or reverse-complement, but not

reversed.)

Example:

>bc1000

CTCTACTTACTTACTG

>bc1001

GTCGTATCATCATGTA

>bc1002

AATATACCTATCATTA

Note: Name barcodes using an alphabetic character prefix to avoid later

barcode name/index confusion.

Output files

Demultiplex Barcodes generates multiple output files by default, all

starting with the same prefix as the output file, using the suffixes

.bam,

.subreadset.xml, and .consensusreadset.xml. The report prefix is

lima. Example:

lima m54007_170702_064558.subreads.bam barcode.fasta /my/path/

m54007_170702_064558_demux.subreadset.xml

For all output files, the prefix is

/my/path/m54007_170702_064558_demux.

• <prefix>.bam: Contains clipped records, annotated with barcode

tags, that passed filters and respect the

--same option.

•

<prefix>.lima.report: A tab-separated file describing each ZMW,

unfiltered. This is useful information for investigating the

demultiplexing process and the underlying data. A single row contains

all reads from a single ZMW. For

--per-read, each row contains one

subread, and ZMWs might span multiple rows.

•

<prefix>.lima.summary: Lists how many ZMWs were filtered, how

many ZMWs are the same or different, and how many reads were

filtered.

(1)

ZMWs input (A): 213120

ZMWs above all thresholds (B): 176356 (83%)

ZMWs below any threshold (C): 36764 (17%)

(2)

ZMW Marginals for (C):

Below min length : 26 (0%)

Below min score : 0 (0%)

Below min end score : 5138 (13%)

Below min passes : 0 (0%)

Below min score lead : 11656 (32%)

Below min ref span : 3124 (8%)

Without adapter : 25094 (68%)

With bad adapter : 10349 (28%) <- Only with --bad-adapter-ratio

Page 29

Undesired hybrids : xxx (xx%) <- Only with --peek-guess

Undesired same barcode pairs : xxx (xx%) <- Only with --different

Undesired diff barcode pairs : xxx (xx%) <- Only with --same

Undesired 5p--5p pairs : xxx (xx%) <- Only with --isoseq

Undesired 3p--3p pairs : xxx (xx%) <- Only with --isoseq

Undesired single side : xxx (xx%) <- Only with --isoseq

Undesired no hit : xxx (xx%) <- Only with --isoseq

(3)

ZMWs for (B):

With same barcode : 162244 (92%)

With different barcodes : 14112 (8%)

Coefficient of correlation : 32.79%

(4)

ZMWs for (A):

Allow diff barcode pair : 157264 (74%)

Allow same barcode pair : 188026 (88%)

Bad adapter yield loss : 10112 (5%) <- Only with --bad-adapter-ratio

Bad adapter impurity : 10348 (5%) <- Only without --bad-adapter-ratio

(5)

Reads for (B):

Above length : 1278461 (100%)

Below length : 2787 (0%)

Explanation of each block:

1. Number of ZMWs that went into lima, how many ZMWs were passed

to the output file, and how many did not qualify.

2. For those ZMWs that did not qualify: The marginal counts of each

filter. (Filter are described in the Options table.)

When running with

--peek-guess or similar manual option combina-

tion and different barcode pairs are found during peek, the full SMRT

Cell may contain low-abundant different barcode pairs that were

identified during peek individually, but not as a pair. Those unwanted

barcode pairs are called hybrids.

3. For those ZMWs that passed: How many were flagged as having the

same or different barcode pair, as well as the coefficient of variation

for the barcode ZMW yield distribution in percent.

4. For all input ZMWs: How many allow calling the same or different

barcode pair. This is a simplified version of how many ZMW have at

least one full pass to allow a different barcode pair call and how many

ZMWs have at least half an adapter, allowing the same barcode pair

call.

5. For those ZMWs that qualified: The number of reads that are above

and below the specified

--min-length threshold.

•

<prefix>.lima.counts: A .tsv file listing the counts of each

observed barcode pair. Only passing ZMWs are counted.

Page 30

Example: column -t prefix.lima.count

• <prefix>.lima.clips: Contains clipped barcode regions generated

using the

--dump-clips option. Example:

head -n 6 prefix.lima.clips

>m54007_170702_064558/4850602/6488_6512 bq:34 bc:11

CATGTCCCCTCAGTTAAGTTACAA

>m54007_170702_064558/4850602/6582_6605 bq:37 bc:11

TTTTGACTAACTGATACCAATAG

>m54007_170702_064558/4916040/4801_4816 bq:93 bc:10

• <prefix>.lima.removed.bam: Contains records that did not pass the

specified thresholds, or are without barcodes, using the option

--dump-removed. lima does not generate a .pbi, nor Data Set for this

file. This option cannot be used with any splitting option.

•

<prefix>.lima.guess: A .tsv file that describes the barcode

subsetting process activated using the

--peek and --guess options.

• One DataSet,

.subreadset.xml, or .consensusreadset.xml file is

generated per output BAM file.

•

.pbi: One PBI file is generated per output BAM file.

What is a universal spacer sequence and how does it affect

demultiplexing?

For library designs that include an identical sequence between adapter

and barcode, such as probe-based linear barcoded adapters samples,

Demultiplex Barcodes offers a special mode that is activated if it finds a

shared prefix sequence among all provided barcode sequences.

Example:

>custombc1

ACATGACTGTGACTATCTCACACATATCAGAGTGCG

>custombc2

ACATGACTGTGACTATCTCAACACACAGACTGTGAG

In this case, Demultiplex Barcodes detects the shared prefix

IdxFirst IdxCombined IdxFirstNamed IdxCombinedNamed Counts MeanScore

0 0 bc1001 bc1001 1145 68

1 1 bc1002 bc1002 974 69

2 2 bc1003 bc1003 1087 68

IdxFirst IdxCombined IdxFirstNamed IdxCombinedNamed NumZMWs MeanScore Picked

0 0 bc1001t bc1001t 1008 50 1

1 1 bc1002t bc1002t 1005 60 1

2 2 bc1003t bc1003t 5 24 0

3 3 bc1004t bc1004t 555 61 1

Page 31

ACATGACTGTGACTATCTCA

and removes it internally from all barcodes.

Subsequently, it increases the window size by the length

L of the prefix

sequence.

•If

--window-size-bp N is used, the actual window size is L + N.

•If

--window-size-mult M is used, the actual window size is

(L + |bc|) * M.

Because the alignment is semi-global, a leading reference gap can be

added without any penalty to the barcode score.

What are bad adapters?

In the

subreads.bam file, each subread has a context flag cx. The flag

specifies, among other things, whether a subread has flanking adapters,

before and/or after. Adapter-finding was improved and can also find

molecularly-missing adapters, or those obscured by a local decrease in

accuracy. This may lead to missing or obscured bases in the flanking

barcode. Such adapters are labelled "bad", as they don't align with the

adapter reference sequence(s). Regions flanking those bad adapters are

problematic, because they can fully or partially miss the barcode bases,

leading to wrong classification of the molecule.

lima can handle those

adapters by ignoring regions flanking bad adapters. For this,

lima

computes the ratio of number of bad adapters divided by number of all

adapters.

By default,

--bad-adapter-ratio is set to 0 and does not perform any

filtering. In this mode, bad adapters are handled just like good adapters.

But the

*.lima.summary file contains one row with the number of ZMWs

that have at least 25% bad adapters, but otherwise pass all other filters.

This metric can be used as a diagnostic to assess library preparation.

--bad-adapter-ratio is set to non-zero positive (0,1), bad adapter

flanking barcode regions are treated as missing. If a ZMW has a higher

ratio of bad adapters than provided, the ZMW is filtered and consequently

removed from the output. The

*.lima.summary file contains two

additional rows.

With bad adapter : 10349 (28%)

Bad adapter yield loss : 10112 (5%)

The first row counts the number of ZMWs that have bad adapter ratios

that are too high; the percentage is with respect to the number of all ZMW

not passing. The second row counts the number of ZMWs that are

removed solely due to bad adapter ratios that are too high; the percentage

is with respect the number of all input ZMWs and consequently is the

effective yield loss caused by bad adapters.

Page 32

If a ZMW has ~50% bad adapters, one side of the molecule is molecularly-

missing an adapter. For 100% bad adapter, both sides are missing

adapters. A lower than ~40% percentage indicates decreased local

accuracy during sequencing leading to adapter sequences not being

found. If a high percentage of ZMWs is molecularly-missing adapters, you

should improve library preparation.

Demultiplexing Iso-Seq

data

Demultiplex Barcodes is used to identify and remove Iso-Seq cDNA

primers. If the Iso-Seq sample is barcoded, the barcodes should be

included as part of the primer. Only by using the command-line can users

use

lima with the --isoseq option for demultiplexing Iso-Seq data.

The input Iso-Seq data format for demultiplexing is

.ccs.bam. Users must

first generate a CCS reads BAM file for an Iso-Seq Data Set before running

lima. The recommended parameters for running CCS analysis for Iso-Seq

are

min-pass=1, min accuracy=0.9, and turning Polish to OFF.

1. Primer IDs must be specified using the suffix

_5p to indicate 5’ cDNA

primers and the suffix

_3p to indicate 3’ cDNA primers. The 3’ cDNA

primer should not include the Ts and is written in reverse complement.

2. Below are four example primer sets. The first is unbarcoded, the

second has barcodes (shown in lower case) adjacent to the 3’ primer.

Example 1: The Iso-Seq cDNA Primer primer set, included with the SMRT

Link installation.

Users following the standard Iso-Seq Express protocol without

multiplexing, or running a Data Set that has already been demultiplexed

(either using Run Design or the SMRT

Analysis application) should use

this default option.

>IsoSeq_5p

GCAATGAAGTCGCAGGGTTGGG

>IsoSeq_3p

GTACTCTGCGTTGATACCACTGCTT

Example 2: The Iso-Seq 12 Barcoded cDNA Primers set, included with the

SMRT Link installation.

Users using barcoded cDNA primers listed in the Appendix 3 -

Recommended barcoded NEBNext single cell cDNA PCR primer and Iso-

Seq Express cDNA PCR primer sequences section of the document

Page 33

Procedure & checklist - Preparing Iso-Seq

libraries using SMRTbell

Prep Kit 3.0, should select this option.

>bc1001_5p

CACATATCAGAGTGCGGCAATGAAGTCGCAGGGTTGGGG

>bc1002_5p

ACACACAGACTGTGAGGCAATGAAGTCGCAGGGTTGGGG

…

(There are a total of 24 sequence records, representing 12 pairs of F/R

barcoded cDNA primers.)

Example 3: An example of a custom cDNA primer set. 4 tissues were

multiplexed using barcodes on the 3’ end only.

>IsoSeq_5p

GCAATGAAGTCGCAGGGTTGGG

>dT_BC1001_3p

AAGCAGTGGTATCAACGCAGAGTACCACATATCAGAGTGCG

>dT_BC1002_3p

AAGCAGTGGTATCAACGCAGAGTACACACACAGACTGTGAG

>dT_BC1003_3p

AAGCAGTGGTATCAACGCAGAGTACACACATCTCGTGAGAG

>dT_BC1004_3p

AAGCAGTGGTATCAACGCAGAGTACCACGCACACACGCGCG

Example 4: Special Handling for the TeloPrime cDNA Kit

The Lexogen TeloPrime cDNA kit contains As in the 3’ primer that cannot

be differentiated from the polyA tail. For best results, remove the As from

the 3’ end as shown below:

>TeloPrimeModified_5p

TGGATTGATATGTAATACGACTCACTATAG

>TeloPrimeModified_3p

CGCCTGAGA

3. Use the --isoseq mode. Note that this cannot be combined with the

--guess option.

4. The output will be only different pairs with a 5p and 3p combination:

demux.5p--tissue1_3p.bam

demux.5p--tissue2_3p.bam

The --isoseq parameter set is very conservative for removing any

spurious and ambiguous calls, and guarantees that only proper

asymmetric (barcoded) primer are used in downstream analyses. Good

libraries reach >75% CCS reads passing the Demultiplex Barcodes filters.

BAM tags

In SMRT Link v11.0, LB and SM tags are set by the user in Run Design. The

SM tag can also be set in Demultiplex Barcodes in SMRT Analysis.

Page 34

Non-demultiplex case:

•

LB: Well Sample Name.

•

SM: Bio Sample Name.

Multiplexed case, BAM pre-demultiplexing:

•

LB: Well Sample Name.

•

SM: Tag removed.

Multiplexed case, BAMs post-demultiplexing:

•

LB: Well Sample Name for all child barcode BAMs.

•

SM: Each individual Bio Sample Name for the specific barcode.

•

BC: Barcode sequence or hyphenated barcode sequences of the pair.

•

DS: Appends barcode information used in demultiplexing: BarcodeFile,

BarcodeHash, BarcodeCount, BarcodeMode, BarcodeQuality.

• Example read group header after demultiplexing:

@RG

ID:66d5a6af/3--3

PL:PACBIO

DS:READTYPE=SUBREAD;

Ipd:CodecV1=ip;

PulseWidth:CodecV1=pw;

BINDINGKIT=101-500-400;

SEQUENCINGKIT=101-427-800;

BASECALLERVERSION=5.0.0;

FRAMERATEHZ=100.000000;

BarcodeFile=Sequel_16_barcodes_v3.barcodeset.xml;

BarcodeHash=f2b1fa0b43eb6ccbb30749883bb550e3;

BarcodeCount=16;

BarcodeMode=Symmetric;

BarcodeQuality=Score

PU:m54010_200212_162236

SM:MySampleName

PM:SEQUEL

BC:ACAGTCGAGCGCTGCGT

Page 35

export-datasets

The export-datasets tool takes one or more PacBio Data Set XML files

and packages all contents (including index files and supplemental Data

Sets) into a single ZIP archive. Data Set resources, such as BAM files, are

reorganized and renamed to flatten the directory structure, avoid

redundant file writes, and convert all resource paths from absolute paths

to relative paths. Where multiple Data Sets are provided, the contents of

each is nested in a directory named after the

UniqueId attribute in the

XML.

The resulting archive is primarily intended to be directly imported into

SMRT Link using the Data Management interface, but it may also be

unpacked manually and used on the command line.

Usage

export-datasets [options] <dataset>...

Input files

• One or more PacBio Dataset XML files.

Output file

•One output ZIP file.

Examples

export-datasets m64001_200704_012345.subreadset.xml

export-datasets sample1.consensusreadset.xml sample2.consensusreadset.xml

\sample3.consensusreadset.xml -o barcoded_ccs.zip

export-datasets /opt/smrtlink/jobs/0000/0000001/0000001234/outputs/

mapped.alignmentset.xml

Options Description

-o, --output Name of output ZIP file. (Default = datasets_<timestamp>.zip)

--keep-parent-ref Keeps the reference to the parent Data Set when archiving a

demultiplexed child Data Set.

--no-scraps Excludes the scraps.bam file if present in the XML file.

-h, --help Displays help information and exits.

--log-file Writes the log to a file. (Default = stderr)

--log-level Specifies the log level; values are [ERROR,DEBUG, INFO, WARN].

(Default = WARN)

--logback Override all logger configuration using a specified logback.xml file.

--log2stdout If True, log output is displayed to the console. (Default = False)

--debug Alias for setting the log level to DEBUG. (Default = False)

--quiet Alias for setting the log level to ERROR. (Default = False)

--verbose Alias for setting the log level to INFO. (Default = False)

Page 36

export-job

The export-job tool packages a SMRT Link Analysis job for export to

another system, usually for reimportation into another SMRT Link

instance. All internal paths in job output files are converted from absolute

to relative paths, and many of the internal details of the Cromwell

workflows are omitted. The export is not a complete record of the job, but

rather a collection of job output files and metadata.

Note that

export-job will include any external Data Sets referenced in

output Data Sets inside the job, for example ReferenceSets associated

with mapped Data Sets, or BarcodeSets associated with demultiplexed

Data Sets. However, these Data Sets will not be imported along with the

job. The exported job does not include the input reads used to run the job;

these may be exported separately using the

export-datasets tool.

Important: Only SMRT Link v10.0 or later generates the necessary

metadata files for

export-job to save a full record of job execution. Jobs

created with older versions of SMRT Link will still be archived, but the

metadata will be empty and/or incorrect.

Usage

export-job [options] <job_dir>

Input

• A path to a job directory.

Output file

•One output ZIP file.

Examples

export-job /path/to/smrtlink/jobs-root/0000/0000000/0000000860 -o job860.zip

Options Description

<job_dir> Path to a SMRT Link job directory.

-o, --output Name of output ZIP file. (Default = job_<timestamp>.zip)

-h, --help Displays help information and exits.

--log-file Writes the log to a file. (Default = stderr)

--log-level Specifies the log level; values are [ERROR,DEBUG, INFO, WARN].

(Default = WARN)

--logback Override all logger configuration using a specified logback.xml file.

--log2stdout If True, log output is displayed to the console. (Default = False)

--debug Alias for setting the log level to DEBUG. (Default = False)

--quiet Alias for setting the log level to ERROR. (Default = False)

--verbose Alias for setting the log level to INFO. (Default = False)

Page 37

To reimport on another system:

pbservice import-job job860.zip

Page 38

gcpp

gcpp is a variant-calling tool provided by the GCpp package which provides

several variant-calling algorithms for PacBio sequencing data.

Usage

gcpp -j8 --algorithm=arrow \

-r lambdaNEB.fa \

-o variants.gff \

aligned_subreads.bam

This example requests variant-calling, using 8 worker processes and the

Arrow algorithm, taking input from the file

aligned_subreads.bam, using

the FASTA file

lambdaNEB.fa as the reference, and writing output to

variants.gff.

A particularly useful option is

--referenceWindow/-w; which allows the

variant-calling to be performed exclusively on a window of the reference

genome.

Input files

• A sorted file of reference-aligned reads in PacBio’s standard BAM

format.

• A FASTA file that follows the PacBio FASTA file convention. If

specifying an input FASTA file, a FASTA index file (

.fai) with the same

name and path is required. If the

.fai file is not supplied, gcpp exits

and displays an error message.

Note: The

--algorithm=arrow option requires that certain metrics be in

place in the input BAM file. It requires per-read SNR metrics, and the per-

base

PulseWidth metric for Sequel data.

The selected algorithm will stop with an error message if any features that

it requires are unavailable.

Output files

Output files are specified as comma-separated arguments to the -o flag.

The file name extension provided to the

-o flag is meaningful, as it

determines the output file format. For example:

gcpp aligned_subreads.bam -r lambda.fa -o myVariants.gff,myConsensus.fasta

will read input from aligned_subreads.bam, using the reference

lambda.fa, and send variant call output to the file myVariants.gff, and

consensus output to

myConsensus.fasta.

The file formats currently supported (using extensions) are:

•

.gff: PacBio GFFv3 variants format; convertible to BED.

•

.vcf: VCF 4.2 variants format (that is compatible with v4.3.)

•

.fasta: FASTA file recording the consensus sequence calculated for

each reference contig.

Page 39

• .fastq: FASTQ file recording the consensus sequence calculated for

each reference contig, as well as per-base confidence scores.

Available algorithms

At this time there are three algorithms available for variant calling:

plurality, poa and arrow.

•

plurality is a simple and very fast procedure that merely tallies the

most frequent read base or bases found in alignment with each

reference base, and reports deviations from the reference as potential

variants. This approach is prone to insertion and deletion errors.

•

poa uses the partial order alignment algorithm to determine the

consensus sequence. It is a heuristic algorithm that approximates a

multiple sequence alignment by progressively aligning sequences to

an existing set of alignments.

•

arrow uses the per-read SNR metric and the per-pulse pulsewidth

metric as part of its likelihood model.

Confidence values

The arrow and plurality algorithms make a confidence metric available

for every position of the consensus sequence. The confidence should be

interpreted as a phred-transformed posterior probability that the

consensus call is incorrect; such as:

gcpp clips reported QV values at 93; larger values cannot be encoded in a

standard FASTQ file.

Options Description

-j Specifies the number of worker processes to use.

--algorithm= Specifies the variant-calling algorithm to use; values are plurality,

arrow and poa. (Default = arrow)

-r Specifies the FASTA reference file to use.

-o Specifies the output file format; values are .gff, .vcf, .fasta, and

.fastq.

--maskRadius When using the arrow algorithm, setting this option to a value N greater

than 0 causes gcpp to pass over the data a second time after masking out

regions of reads that have >70% errors in 2*N+1 bases. This setting has

little to no effect at low coverage, but for high-coverage datasets (>50X),

setting this parameter to 3 may improve final consensus accuracy. In rare

circumstances, such as misassembly or mapping to the wrong reference,

enabling this parameter may cause worse performance.

--minConfidence MINCONFIDENCE

-q MINCONFIDENCE

Specifies the minimum confidence for a variant call to be output to

variants.{gff,vcf} (Default = 40)

--minCoverage MINCOVERAGE

-x MINCOVERAGE

Specifies the minimum site coverage for variant calls and consensus to be

calculated for a site. (Default = 5)

Page 40

Chemistry specificity

The --algorithm=arrow parameter is trained per-chemistry. arrow

identifies the sequencing chemistry used for each run by looking at

metadata contained in the input BAM data file. This behavior can be

overridden by a command-line option.

When multiple chemistries are represented in the reads in the input file,

the Arrow will model reads appropriately using the parameter set for its

chemistry, thus yielding optimal results.

Page 41

Genome

Assembly

The Genome Assembly application generates de novo assemblies using

HiFi reads. The application is fast, produces contiguous assemblies, and

is suitable for genomes of any size.

The Genome Assembly application is powered by the IPA HiFi genome

assembler and includes the following features:

• Separates haplotypes during assembly using a novel phasing stage

(Nighthawk).

• Polishes the contigs with phased reads using

Racon.

• Improves haplotype separation using the

purge_dups tool.

Workflow of the Genome Assembly application

Analysis steps are highly optimized to produce assemblies of large

genomes efficiently.

The workflow consists of seven stages:

1. Sequence database construction.

2. Fast overlap computation using the

Pancake tool.

3. A dedicated phasing stage using the

Nighthawk tool.

4. Filtering chimeras and residual repeats.

5. Layout based on the string graph.

6. Polishing using the

Racon tool.

7. Purging haplotype duplicates from the primary assembly using the

third-party tool

purge_dups.

The workflow accepts HiFi XML Data Sets as input.

IPA HiFi genome assembler

• Scales well on a cluster.

• The workflow has an embedded downsampling feature:

– If the genome size and the desired coverage are specified, the initial

stage (sequence database construction) downsamples the input

Data Set to the desired coverage.

– Otherwise, the full coverage is used.

Usage

The Genome Assembly application is run using the pbcromwell run

command, with the

pb_assembly_hifi parameter to specify the

application. See “pbcromwell” on page 75 for details.

Page 42

To view information on the available Genome Assembly options, enter:

pbcromwell show-workflow-details pb_assembly_hifi

The minimum command needed to run the workflow requires the input

and the number of threads. The following example uses 16 threads:

pbcromwell run pb_assembly_hifi -e <input.xml> --nproc 16

The following example performs an assembly using an input XML Data

Set, and uses all default settings, including 1 CPU:

pbcromwell run pb_assembly_hifi -e <input.consensusreadset.xml>

Note: The default options for this workflow are equivalent to the following

command:

pbcromwell run pb_assembly_hifi \

-e <input.consensusreadset.xml> \

--task-option reads=None \

--task-option ipa2_genome_size=0 \

--task-option ipa2_downsampled_coverage=0 \

--task-option ipa2_advanced_options="" \

--task-option ipa2_run_polishing=True \

--task-option ipa2_run_phasing=True \

--task-option ipa2_run_purge_dups=True \

--task-option ipa2_ctg_prefix="ctg." \

--task-option ipa2_reads_db_prefix="reads" \

--task-option ipa2_cleanup_intermediate_files=True \

--task-option dataset_filters="" \

--task-option filter_min_qv=20 \

--nproc 8

The default options for this workflow should work well for any genome

types.

If the assembly is run on a single local node with high CPU count, such as

64 cores, we recommend that the job submission for

pbcromwell is

configured so that it uses 4 concurrent jobs and 16 threads per job.

We found this to be more efficient than using 64 threads and 1 concurrent

job, as many steps are very data I/O-dependent.

You can apply a similar principle for compute environments with more or

fewer cores. For example, for a machine with 80 cores, one can use 20

threads and 4 concurrent jobs.

Page 43

Genome Assembly parameters input files

Option Default value Description

-e, --eid_ccs NONE Optional parameter, required if --task-option reads

<input> is not specified. This is a SMRT Link-specific input

parameter and supports only PacBio Consensusreadset

XML files as input.

--task-option reads NONE Optional parameter, required if -e <input> is not specified.

Supports multiple input formats: FASTA, FASTQ, BAM, XML,

FOFN and gzipped versions of FASTA/FASTQ.

--task-option

ipa2_genome_size

0 The approximate number of base pairs expected in the

genome. This is used only for downsampling; if the value is ≤

0, downsampling is disabled. Note: It is better to slightly

overestimate rather than underestimate the genome length to

ensure good coverage across the genome.

--task-option

ipa2_downsampled_coverage

0 The input Data Set can be downsampled to a desired

coverage, provided that both the

ipa2_downsampled_coverage and ipa2_genome_size

options are specified and >0.

Downsampling applies to the entire assembly process,

including polishing.

This parameter selects reads randomly, using a fixed random

seed for reproducibility.

--task-option

ipa2_advanced_options

NONE A semicolon-separated list of KEY=VALUE pairs. New line

characters are not accepted. (These are described later in

this document.)

--task-option

ipa2_run_polishing

TRUE Enables or disables the polishing stage of the workflow.

Polishing can be disabled to perform fast draft assemblies.

--task-option

ipa2_run_phasing

TRUE Enables or disables the phasing stage of the workflow.

Phasing can be disabled to assemble haploid genomes, or to

perform fast draft assemblies.

--task-option

ipa2_run_purge_dups

TRUE Enables or disables identification of “duplicate” alternate

haplotype contigs which may be assembled in the primary

contig file, and moves them to the associate contig (haplotig)

file.

--task-option

ipa2_ctg_prefix

.ctg The prefix used to label the output generated contigs.

--task-option

ipa2_reads_db_prefix

reads The prefix of the sequence and seed databases which will be

used internally for assembly.

--task-option

ipa2_cleanup_intermediate

_files

TRUE Removes intermediate files from the run directory to save

space.

--task-option

dataset_filters

NONE (General pbcromwell option)

A semicolon-separated (not

comma-separated) list of other filters to add to the Data Set.

--task-option

filter_min_qv

20 (General pbcromwell option) Phred-scale integer QV cutoff

for filtering HiFi reads. The default for all applications is 20

(QV 20), or 99% predicted accuracy.

--task-option

downsample_factor

0 Downsampling factor applied directly to the input Data Set.

This parameter is not related to

ipa2_downsampled_coverage.

--task-option

mem_scale_factor

8 Controls the amount of requested memory for individual

Cromwell tasks in the workflow. The default value of 8 is

good for larger genomes, but it may be too much memory for

smaller genomes.

Page 44

• *.bam file containing PacBio data.

•

*.fasta or *.fastq file containing PacBio data.

•

*.xml file containing PacBio data.

•

*.fofn files with file names of files containing PacBio data.

Output files

• final_purged_primary.fasta file containing assembled primary

contigs.

•

final_purged_haplotigs.fasta file containing assembled

haplotigs.

Advanced parameters

Advanced parameters should be rarely modified. For the special cases

when that is required, advanced parameters are documented below.

Advanced parameters specified on the command line:

•Are in the form of

key = value pairs.

• Each pair is separated by a semicolon (

;) character.

• The full set of advanced parameters is surrounded by one set of

double quotes.

• The specified value of a parameter overwrites the default options for

that key. All desired options of that parameter must be explicitly listed,

not just the ones which should change from the default.

• Setting an empty value clears the parameter; it does not reset the

value back to default.

Example:

--task-option ipa2_advanced_options="config_seeddb_opt=-k 28;config_block_size=2048"

--config NONE (General pbcromwell option) Java configuration file for

running Cromwell.

--nproc 1 (General pbcromwell option) Number of processors, (except

per task).

Option Default value Description

Page 45

Complete list of available advanced parameters and default values:

Advanced parameters Default value Description

config_genome_size 0 The approximate number of base pairs expected in the

genome, used to determine the coverage cutoff. This is

only used for downsampling; 0 turns downsampling off.

Note: It is better to slightly overestimate rather than

underestimate the genome length to ensure good

coverage across the genome.

config_coverage 0 The input Data Set can be downsampled to a desired

coverage, provided that both the Downsampled

coverage and Genome Length parameters are

specified and above 0.

Downsampling applies to the entire assembly process,

including polishing. This feature selects reads randomly,

using a fixed random seed for reproducibility.

config_polish_run 1 Enables or disables the polishing stage of the workflow.

Polishing can be disabled to perform fast draft

assemblies. 0 disables this feature; 1 enables it.

config_phase_run 1 Enables or disables the phasing stage of the workflow.

Phasing can be disabled to assemble haploid genomes,

or to perform fast draft assemblies. 0 disables this

feature; 1 enables it.

config_purge_dups_run 1 Enables or disables the purge_dups stage of the

workflow. 0 disables this feature; 1 enables it.

config_autocomp_max_cov 1 If enabled, the maximum allowed overlap coverage at

either the 5’ or the 3’ end of every read is automatically

determined based on the statistics computed from the

overlap piles. This value is appended to the

config_ovl_filter_opt value internally, and

supersedes the manually specified --max-cov and

--max-diff values of that parameter. These options

are used to determine potential repeats and filter out

those reads before the string graph is constructed.

0 disables this feature; 1 enables it.

config_block_size 4096 The overlapping process is performed on pairs of blocks

of input sequences, where each block contains the

number of sequences which crop up to this size (in

Mbp). Note: The number of pairwise comparisons grows

quadratically with the number of blocks (meaning more

cluster jobs), but also the larger the block size the more

resources are required to execute each pairwise

comparison.

config_existing_db_prefix NONE Allows injection of an existing SeqDB, so that one

doesn’t have to be built from scratch. The provided

existing DB is symbolically linked and used for assembly.

(This option is intended for debugging purposes.)

Page 46

config_ovl_filter_opt --max-diff 80

--max-cov 100

--min-cov 2

--bestn 10

--min-len 4000

--gapFilt

--minDepth 4

--idt-stage2 98

Overlap filter options.

--gapFilt - Enables the chimera filter, which analyzes

each overlap pile, and determines whether a pread is

chimeric based on the local coverage across the pread.

--minDepth - Option for the chimera filter. The chimera

filter is ignored when a local region of a read has

coverage lower than this value.

The other parameters are:

--min-cov - Minimum allowed coverage at either the 5'

or the 3' end of a read. If the coverage is below this

value, the read is blacklisted and all of the overlaps it is

incident with are ignored. This helps remove potentially

chimeric reads.

--max-cov - Maximum allowed coverage at either the 5'

or the 3' end of a read. If the coverage is above this

value, the read is blacklisted and all of the overlaps it is

incident with are ignored. This helps remove repetitive

reads which can make tangles in the string graph. Note

that this value is a heuristic which works well for ~30x

seed length cutoff. If the cutoff is set higher, we advise

that this value be also increased. Alternatively, using the

autocompute_max_cov option can automatically

estimate the value of this parameter, which can improve

contiguity (for example, in cases when the input genome

size or the seed coverage were overestimated).

--max-diff - Maximum allowed difference between the

coverages at the 5' and 3' ends of any particular read. If

the coverage is above this value, the read is blacklisted

and all of the overlaps it is incident with are ignored. If the

autocompute_max_cov option is used, then the same

computed value is supplied to this parameter as well.

--bestn - Keep at most this many overlaps on the 5'

and the 3' side of any particular read.

--min-len - Filter overlaps where either A-read or the

B-read are shorter than this value.

--idt-stage2 - Filter overlaps with identity below 98%.

--high-copy-sample-rate - Controls the

downsampling of reads from high copy elements to the

expected coverage determined by maxCov*rate, where

rate is the value of this parameter. If rate is 0, then

these high coverage reads are discarded.

config_ovl_min_idt 98 The final overlap identity threshold. Applied during the

final filtering stage, right before the overlaps are passed

to the layout stage.

config_ovl_min_len 1000 The minimum length of either A-read or a B-read to keep

the overlap. Applied during the final filtering stage, right

before the overlaps are passed to the layout stage.

config_ovl_opt --one-hit-per-

target

--min-idt 96

Overlapping options for the pancake overlapping tool.

The options set by this parameter here are passed

directly to pancake. For details on pancake options,

use pancake -h.

The defaults used here are: --one-hit-per-target

which keeps only the best hit in case there are multiple

possible overlaps between a pair of reads (tandem

repeats); and --min-idt 96 which will filter out any

overlap with identity lower than 96%.

config_phasing_opt

NONE Options for the phasing tool nighthawk. The options

set by this parameter are passed directly to nighthawk.

For details on nighthawk options, use nighthawk -h.

Advanced parameters Default value Description

Page 47

config_phasing_split_opt --split-type

noverlaps

--limit 3000000

Options that control the chunking of the phasing jobs,

and through that regulate the time and memory

consumption of each individual chunk.

The defaults are: --split-type noverlaps which

splits the chunks by the number of overlaps; and

--limit 3000000 which allow at most approximately

3 million overlaps per chunk.

Empirically, the current defaults keep the maximum

memory consumption (RSS) of the phasing jobs under 4

GB per chunk.

config_seeddb_opt -k 28

-w 120

--space 1

Options to control the seed computation. These options

are passed directly to the pancake seeddb command.

Defaults: -k 28 is the k-mer size of 28 bp; -w 120 is the

minimizer window size of 120 bp; and --space 1

specifies the spacing for spaced seed construction, with

1 gap in between every two bases of the seed.

For more details on these and other options, use

pancake seeddb -h.

config_seqdb_opt --compression 1 Options to control the construction of the sequence

database. These options are passed directly to the

pancake seqdb command.

Current default is --compression 1 which turns on the

2-bit encoding compression of the sequences.

For more details on these and other options, use

pancake seqdb -h.

config_use_hpc 0 This parameter enables (1) or disables (0) an

experimental Homopolymer Compression feature.

If this feature is enabled, the overlaps are computed

from homopolymer-compressed sequences. The layout

stage is somewhat slower because the sequences have

to be aligned to determine the correct homopolymer-

expanded coordinates.

config_use_seq_ids 1 This feature is mostly useful for debugging purposes. If

0 is specified, then the overlaps contain original

sequence names instead of their numerical IDs.

The default of 1 uses the numerical IDs to represent

reads, which uses memory much more efficiently.

config_purge_map_opt --min-map-len

1000

--min-idt 98.0

--bestn 5

This option is used to control the mapping of the reads to

contigs for the purge_dups tool. The mapper used is

pancake, and the options set by this parameter are used

directly by pancake. For details on pancake options,

use pancake -h.

Option --min-map-len 1000 removes any alignment

which did not span more than 1000 bp during the

mapping process; --min-idt 98.0 removes any

alignment with identity below 98.0%, and --bestn 5

keeps at most 5 top scoring alignments for each query

read (one primary alignment and at most 4 secondary

alignments).

Advanced parameters Default value Description

Page 48

config_purge_dups_calcuts NONE The third-party tool purge_dups can accept user-

defined cutoffs for purging. On some genomes, the

automated computation of the cutoffs in purge_dups

can result in suboptimal values, and in this case a user

can specify them manually.

This option is passed directly to the

purge_dups_calcuts tool. For details on the possible

values that can be passed to this tool, use

ipa_purge_dups_calcuts without parameters.

Relevant parameters include:

-l INT Lower bound for read depth.

-m INT Transition between haploid and diploid.

-u INT Upper bound for read depth.

config_m4filt_high_copy_s

ample_rate

1.0 This option is passed to the --high-copy-sample-

rate parameter of the overlap filter, which controls the

downsampling of reads from high copy elements to the

expected coverage determined by maxCov*rate, where

rate is the value of this parameter. If 0, then these high

coverage reads are discarded.

Note: This parameter supersedes the

config_ovl_filter_opt options.

config_max_polish_block_m

100 During the polishing stage, contigs are grouped into

chunks of approximate size specified by this parameter

(in megabases). Each chunk is processed separately and

in parallel (depending on the system configuration).

config_layout_opt NONE This value is passed directly to the assembly layout

stage. To get a list of valid options for this parameter,

enter ipa2_ovlp_to_graph -h on the command line.

Advanced parameters Default value Description

Page 49

HiFiViral SARS-

CoV-2 Analysis

Use this application to analyze multiplexed samples sequenced with the

HiFiViral SARS-CoV-2 Kit. For each sample, this analysis provides:

• Consensus sequence (FASTA).

• Variant calls (VCF).

• HiFi reads aligned to the reference (BAM).

• Plot of HiFi read coverage depth across the SARS-CoV-2 genome.

Across all samples, this analysis provides:

• Job summary table including passing sample count at 90 and 95%

genome coverage.

• Sample summary table including, for each sample: Count of variable

sites, genome coverage, read coverage, and probability of multiple

strains, and other metrics.

• Plate QC graphical summary of performance across samples in assay

plate layout.

• Plot of HiFi read depth of coverage for all samples.

Notes:

• The application accepts HiFi reads (BAM format) as input. HiFi reads

are reads generated with CCS analysis that have a quality value equal

to or greater than Phred-scaled Q20.

• This application is for SARS-CoV-2 analysis only and is not

recommended for other viral studies. The Wuhan reference genome is

included with SMRT Link and used by default, but advanced users may

specify other reference genomes. We have not tested the application

with reference genomes other than the Wuhan reference genome.

• The application is intended to identify variable sites and call a single

consensus sequence per sample. The output consensus sequence is

produced based on the dominant variant observed. Minor variant

information that passes through a default threshold may be encoded

in the raw VCF, but does not get propagated into the consensus

sequence FASTA.

• The HiFiViral SARS-CoV-2 Analysis application can be run using the

Auto Analysis feature available in Run Design. This feature allows

users to complete all necessary analysis steps immediately after

sequencing without manual intervention. The Auto Analysis workflow

includes CCS, Demultiplex Barcodes, and HiFiViral SARS-CoV-2

Analysis.

HiFiViral SARS-CoV-2 application workflow

1. Process the reads using the

mimux tool to trim the probe arm

sequences.

2. Align the reads to the reference genome using

pbmm2.

3. Call and filter variants using

bcftools, generating the raw variant calls

in VCF file format. Filtering in this step removes low-quality calls (less

than Q20), and normalizes indels.

Page 50

4. Filter low-frequency variants using vcfcons and generate a consensus

sequence by injecting variants into the reference genome. At each

position, a variant is called only if both the base coverage exceeds the

minimum base coverage threshold (Default = 4) and the fraction of

reads that support this variant is above the minimum variant frequency

threshold (Default = 0.5). See here for details.

Preparing input data for the HiFiViral SARS-CoV-2 Analysis

application

1. Run the Demultiplex Barcodes cromwell workflow, where the input to

that application are HiFi reads, and the primers are multiplexed

barcode primers. See “Demultiplex Barcodes” on page 23 for details. If

HiFi reads have not been generated on the instrument, run CCS analy-

sis first. See “ccs” on page 7 for details.

– Provide the proper barcode sequences:

Barcoded M13 Primer Plate.

– Use the task option

lima_symmetric_barcodes=false. (The

barcode pairs are asymmetric.)

• Provide the correctly-formatted barcode pair-to-Bio Sample CSV file.

For details, see the

--task-option sample_wells_csv option in the

table further down this page.

Input files

• movie.consensusreadset.xml: Previously-demultiplexed HiFi reads,

packaged as separate BAM files wrapped in an XML Data Set. (See

“Preparing input data for the HiFiViral SARS-CoV-2 Analysis

application” on page 50 for details.)

•

sars_cov2.referenceset.xml: The Wuhan reference genome.

•

HiFiViral_SARS-CoV-2_Enrichment_Probes.barcodeset.xml:

Dataset XML specifying the probe sequence file in FASTA format.

•[Optional]

Plate QC csv: Four-column CSV file that includes barcode,

biosample, plateID and wellID.

To specify this optional file, add the following option to

pbservice:

--task-option sample_wells_csv=<path to CSV file>

Output files

• pb_sars_cov2_kit.probe_counts_zip: Zipped TSV files with probe

counts, per sample.

•

pb_sars_cov2_kit.variants_csv: CSV variant calls for all samples.

•

pb_sars_cov2_kit.vcf_zip: Zipped VCF files containing the final

variant calls, per sample.

•

pb_sars_cov2_kit.raw_vcf_zip: Zipped VCF files containing the

raw variant calls, per sample.

•

pb_sars_cov2_kit.fasta_zip: Zipped final consensus sequences,

by sample, in FASTA format. This is a single consensus sequence with

Ns for each sample.

•

pb_sars_cov2_kit.frag_fasta_zip: Zipped file of consensus

sequences, split on

Ns, in FASTA format, by sample.

Page 51

• pb_sars_cov2_kit.mapped_zip: Zipped BAM files containing the

output from mapping HiFi reads to the reference genome, by sample.

•

samples.consensus_mapped.bam.zip: Zipped BAM files containing

the output from mapping consensus FASTA files to the reference

genome, by sample.

•

samples.coverage.png.zip: Zipped per-sample coverage graphs in

png format.

•

hifi_reads.fastq.zip: Zipped file of per-sample trimmed HiFi reads

in FASTQ format.

•

sample_summary.csv: Sample Summary file in CSV format, with one

row per sample.

Running the SARS-CoV-2 Analysis application

pbcromwell run pb_sars_cov2_kit \

-e <movie.consensusreadset.xml> \

-e $SMRT_ROOT/current/bundles/smrtinub/current/private/pacbio/barcodes/ HiFiViral_SARS-

CoV-2_Enrichment_Probes.barcodeset.xml \

-e eid_ref_dataset_2:$SMRT_ROOT/current/bundles/smrtinub/current/private/pacbio/

canneddata/referenceset/SARS-CoV-2/sars_cov2.referenceset.xml

--task-option min_alt_freq=0.5 \

--task-option min_bq=80 \

--task-option mimux_overrides="--max-len=800 --same" \

--task-option sample_wells_csv=None \

--config cromwell.conf \

--nproc 8

Options Description

--task-option

sample_wells_csv

Specifies a 4-column CSV file used to generate the Plate QC Report, which

displays analysis results for each sample in the assay plate. The CSV file

must contain barcode pairs, Bio Sample name, Plate IDs, and Well IDs. The

report is useful for diagnosing sample issues based on plate location.

(Default = None)

--task-option min_coverage Specifies the minimum read coverage. Below this value, the consensus

sequence will be set to Ns and no variants are called. (Default = 4)

--task-option min_alt_freq Specifies that only variants whose frequency is greater than this value are

reported. This frequency is determined based on the read depth (DP) and

allele read count (AD) information in the VCF output file. We recommend

using the default value to properly call the dominant alternative variant

while also filtering out potential artifacts. (Default = 0.5)

--task-option min_bq Specifies that reads with barcode scores below this minimum value are

not included in analysis. (Default = 80)

--task-option mimux_overrides Specifies additional options to pass to the mimux preprocessing tool for

trimming and filtering reads by probe sequences. Options should be

entered in space-separated format.

Available options include:

--max-len: Specifies the maximum sequence length. (Default = 800)

--same: Specifies that only reads with arms sequences from the same

probe are used.

--task-option probes_fasta Specifies a FASTA file containing probes sequences if using probes other

than those supplied with the HiFiViral SARS-CoV-2 kit.

Page 52

ipdSummary

The ipdSummary tool detects DNA base-modifications from kinetic

signatures. It is part of the

kineticsTool package.

kineticsTool loads IPDs observed at each position in the genome,

compares those IPDs to value expected for unmodified DNA, and outputs

the result of this statistical test. The expected IPD value for unmodified

DNA can come from either an in-silico control or an amplified control. The

in-silico control is trained by PacBio and shipped with the package. It

predicts the IPD using the local sequence context around the current

position. An amplified control Data Set is generated by sequencing

unmodified DNA with the same sequence as the test sample. An amplified

control sample is usually generated by whole-genome amplification of the

original sample.

Modification detection

The basic mode of kineticsTool does an independent comparison of IPDs

at each position on the genome, for each strand, and outputs various

statistics to CSV and GFF files (after applying a significance filter).

Modifications identification

kineticsTool also has a Modification Identification mode that can decode

multi-site IPD “fingerprints” into a reduced set of calls of specific

modifications. This feature has the following benefits:

• Different modifications occurring on the same base can be

distinguished; for example, 6mA and 4mC.

• The signal from one modification is combined into one statistic,

improving sensitivity, removing extra peaks, and correctly centering

the call.

Algorithm: Synthetic control

Studies of the relationship between IPD and sequence context reveal that

most of the variation in mean IPD across a genome can be predicted from

a 12-base sequence context surrounding the active site of the DNA

polymerase. The bounds of the relevant context window correspond to the

window of DNA in contact with the polymerase, as seen in DNA/

polymerase crystal structures. To simplify the process of finding DNA

modifications with PacBio data, the tool includes a pre-trained lookup

table mapping 12-mer DNA sequences to mean IPDs observed in C2

chemistry.

Algorithm: Filtering and trimming

kineticsTool uses the Mapping QV generated by pbmm2 and stored in the

cmp.h5 or BAM file (or AlignmentSet) to ignore reads that are not

confidently mapped. The default minimum Mapping QV required is 10,

implying that

pbmm2 has 90% confidence that the read is correctly mapped.

Because of the range of read lengths inherent in PacBio data, this can be

changed using the

--mapQvThreshold option.

Page 53

There are a few features of PacBio data that require special attention to

achieve good modification detection performance.

kineticsTool

inspects the alignment between the observed bases and the reference

sequence for an IPD measurement to be included in the analysis. The

PacBio read sequence must match the reference sequence for

k around

the cognate base. In the current module,

k=1. The IPD distribution at some

locus can be thought of as a mixture between the “normal” incorporation

process IPD, which is sensitive to the local sequence context and DNA

modifications, and a contaminating “pause” process IPD, which has a

much longer duration (mean > 10 times longer than normal), but happen

rarely (~1% of IPDs).

Note: Our current understanding is that pauses do not carry useful

information about the methylation state of the DNA; however a more

careful analysis may be warranted. Also note that modifications that

drastically increase the roughly 1% of observed IPDs are generated by

pause events. Capping observed IPDs at the global 99

percentile is

motivated by theory from robust hypothesis testing. Some sequence

contexts may have naturally longer IPDs; to avoid capping too much data

at those contexts, the cap threshold is adjusted per context as follows:

capThreshold = max(global99, 5*modelPrediction,

percentile(ipdObservations, 75))

Algorithm: Statistical testing

We test the hypothesis that IPDs observed at a particular locus in the

sample have longer means than IPDs observed at the same locus in

unmodified DNA. If we have generated a Whole Genome Amplified Data

Set, which removes DNA modifications, we use a case-control, two-

sample t-test. This tool also provides a pre-calibrated “synthetic control”

model which predicts the unmodified IPD, given a 12-base sequence

context. In the synthetic control case we use a one-sample t-test, with an

adjustment to account for error in the synthetic control model.

Usage

To run using a BAM input, and output GFF and HDF5 files:

ipdSummary aligned.bam --reference ref.fasta m6A,m4C --gff basemods.gff \

--csv_h5 kinetics.h5

To run using cmp.h5 input, perform methyl fraction calculation, and output

GFF and CSV files:

ipdSummary aligned.cmp.h5 --reference ref.fasta m6A,m4C --methylFraction \

--gff basemods.gff --csv kinetics.csv

Page 54

Input files

• A standard PacBio alignment file - either AlignmentSet XML, BAM, or

cmp.h5 - containing alignments and IPD information.

• Reference sequence used to perform alignments. This can be either a

FASTA file or a ReferenceSet XML.

Output files

The tool provides results in a variety of formats suitable for in-depth

statistical analysis, quick reference, and consumption by visualization

tools. Results are generally indexed by reference position and reference

strand. In all cases the strand value refers to the strand carrying the

modification in the DNA sample. Remember that the kinetic effect of the

modification is observed in read sequences aligning to the opposite

strand. So reads aligning to the positive strand carry information about

modification on the negative strand and vice versa, but the strand

containing the putative modification is always reported.

•

modifications.gff: Compliant with the GFF Version 3 specification.

Each template position/strand pair whose probability value exceeds

the probability value threshold appears as a row. The template

position is 1-based, per the GFF specifications. The strand column

refers to the strand carrying the detected modification, which is the

opposite strand from those used to detect the modification. The GFF

confidence column is a Phred-transformed probability value of

detection.

The auxiliary data column of the GFF file contains other statistics

which may be useful for downstream analysis or filtering. These

include the coverage level of the reads used to make the call, and +/-

20 bp sequence context surrounding the site.

•

modifications.csv: Contains one row for each (reference position,

strand) pair that appeared in the Data Set with coverage at least

x defaults to 3, but is configurable with the --minCoverage option. The

reference position index is 1-based for compatibility with the GFF file

in the R environment. Note that this output type scales poorly and is

not recommended for large genomes; the HDF5 output should perform

much better in these cases.

Output options Description

--gff FILENAME GFF format.

--csv FILENAME Comma-separated value format.

--bigwig FILENAME BigWig file format.

Page 55

Output columns: In-silico control mode

Output columns: Case control mode

Column Description

refId Reference sequence ID of this observation.

tpl 1-based template position.

strand Native sample strand where kinetics were generated. 0 is the strand of the original

FASTA, 1 is opposite strand from FASTA.

base The cognate base at this position in the reference.

score Phred-transformed probability value that a kinetic deviation exists at this position.

tMean Capped mean of normalized IPDs observed at this position.

tErr Capped standard error of normalized IPDs observed at this position

(standard deviation/sqrt(coverage)).

modelPrediction Normalized mean IPD predicted by the synthetic control model for this sequence

context.

ipdRatio tMean/modelPrediction.

coverage Count of valid IPDs at this position.

frac Estimate of the fraction of molecules that carry the modification.

fracLow 2.5% confidence bound of the frac estimate.

fracUpp 97.5% confidence bound of the frac estimate.

Column Description

refId Reference sequence ID of this observation.

tpl 1-based template position.

strand Native sample strand where kinetics were generated. 0 is the strand of the

original FASTA, 1 is opposite strand from FASTA.

base The cognate base at this position in the reference.

score Phred-transformed probability value that a kinetic deviation exists at this

position.

caseMean Mean of normalized case IPDs observed at this position.

controlMean Mean of normalized control IPDs observed at this position.

caseStd Standard deviation of case IPDs observed at this position.

controlStd Standard deviation of control IPDs observed at this position.

ipdRatio tMean/modelPrediction.

testStatistic T-test statistic.

coverage Mean of case and control coverage.

controlCoverage Count of valid control IPDs at this position.

caseCoverage Count of valid case IPDs at this position.

Page 56

isoseq3

The isoseq3 tool enables analysis and functional characterization of

transcript isoforms for sequencing data generated on PacBio instruments.

The analysis can be performed de novo, without a reference genome. If a

reference genome is available, an optional collapse step to produce

unique isoform files based on genomic coordinates is available. The input

isoseq3 tools should be HiFi (CCS) reads in BAM format.

Usage

isoseq3 <tool>

Typical workflow

1. Visualize primers, then remove primers and demultiplex:

cat primers.fasta

>5p

GCAATGAAGTCGCAGGGTTGGGG

>3p

GTACTCTGCGTTGATACCACTGCTT

lima movie.ccs.bam primers.fasta demux.bam --isoseq

See “Demultiplex Barcodes” on page 23 for details on the lima tool.

2. Identify and remove polyA tails; also remove artificial concatemers.

The output are full-length, non-concatemer (FLNC) reads:

isoseq3 refine demux.5p--3p.bam primers.fasta flnc.bam --require-polya

3. Cluster FLNC reads at the isoform levels to generate consensus tran-

script isoform sequences. This generates

unpolished.hq.bam and

unpolished.hq.fasta.gz files, which are the high-quality (HQ)

transcripts that should be analyzed further.

isoseq3 cluster flnc.bam unpolished.bam --use-qvs

4. (Optional) Map transcripts to the genome and collapse HQ transcripts

based on genomic mapping:

pbmm2 align unpolished.bam reference.fasta aligned.sorted.bam --preset ISOSEQ --sort

isoseq3 collapse aligned.sorted.bam out.gff or

isoseq3 collapse aligned.sorted.bam movie.ccs.bam out.gff

See “pbmm2” on page 84 for details.

Options Description

-h, --help Displays help information and exits.

--version Displays program version number and exits

Page 57

refine

Tool: Remove polyA and concatemers from full-length (FL) reads

and generate full-length non-concatemer (FLNC) transcripts (FL to FLNC).

Usage

isoseq refine [options] <ccs.demux.bam|xml> <primer.fasta|xml> <flnc.bam|xml>

cluster Tool: Cluster FLNC reads and generate transcripts.

Usage

isoseq3 cluster [options] input output

Example

isoseq3 cluster flnc.bam unpolished.bam --use-gvs

Custom BAM tags

isoseq3 cluster adds the following custom PacBio tags to the output

BAM file:

•

ib: Barcode summary: triplets delimited by semicolons, each triplet

contains two barcode indices and the ZMW counts, delimited by

commas. Example:

0,1,20;0,3,5

• im: ZMW names associated with this isoform.

Inputs/outputs Description

ccs.demux.bam|xml Input demultiplexed CCS reads BAM or ConsensusReadSet XML file.

This is usually the output from running lima with the --isoseq option,

such as demux.5p--3p.bam.

primer.fasta|xml Input primer FASTA or BarcodeSet XML file.

flnc.bam|xml Output FLNC BAM or ConsensusReadSet XML file.

Preprocessing Description

--min-polya-length Specifies the minimum poly(A) tail length. (Default = 20)

--require-polya Requires reads to have a poly(A) tail and remove it.

Options Description

--help, -h Displays help information and exits.

--version Displays program version number and exits.

--verbose, -v Sets the verbosity level.

-j,--num-threads Specifies the number of threads to use when processing; 0 means

autodetection. (Default = 0)

--log-file Writes the log to a file. (Default = stderr)

--log-level Specifies the log level; values are [DEBUG, INFO, WARN, TRACE, FATAL].

(Default = WARN)

Page 58

• is: Number of ZMWs associated with this isoform.

summarize Tool: Create a .csv-format barcode overview from transcripts.

Usage

isoseq3 summarize [options] input output

Example

isoseq3 summarize unpolished.bam summary.csv

Inputs/outputs Description

input flnc.bam file or movie.consensusreadset.xml file.

output unpolished.bam file prefix (the actual output files will be

unpolished.hq.bam and unpolished.lq-bam) or

unpolished.transcriptset.xml file.

Options Description

--s1 Specifies the number of seeds for minimer-only clustering.

(Default = 1000)

--s2 Specifies the number of seeds for DP clustering. (Default = 1000)

--poa-cov Specifies the maximum number of CCS reads used for POA consensus.

(Default = 10)

--use-qvs Use CCS analysis Quality Values; sets --poa-cov to 100.

--split-bam Splits BAM output files into a maximum of N files; 0 means no splitting.

(Default = 0)

--min-subreads-split Subread threshold for High-Quality/Low-Quality split; only works with

--use-qvs. (Default = 7)

--log-level Specifies the log level; values are [DEBUG, INFO, WARN, ERROR,

CRITICAL]. (Default = WARN)

-v,--verbose Uses verbose output.

-j,--num-threads Specifies the number of threads to use; 0 means autodetection.

(Default = 0)

--log-file Writes the log to a file. (Default = stdout)

Inputs/outputs Description

input unpolished.bam file. (The output from isoseq3 cluster.)

output summary.csv file.

Options Description

--log-level Specifies the log level; values are [DEBUG, INFO, WARN, ERROR,

CRITICAL]. (Default = WARN)

-v,--verbose Uses verbose output.

--log-file Writes the log to file. (Default = stdout)

Page 59

collapse

Tool: Collapse transcripts based on genomic mapping.

Usage

isoseq3 collapse [options] <alignments.bam|xml> <ccs.bam|xml> <out.fastq>

Examples:

isoseq3 collapse aligned.sorted.bam out.gff

isoseq3 collapse aligned.sorted.bam ccs.bam out.gff

Inputs/outputs Description

alignments Alignments mapping transcripts to the reference genome. (BAM or XML

file).

ccs.bam Optional input BAM file containing CCS reads.

out.fastq Collapsed transcripts in FASTQ format.

Options Description

--min-aln-coverage Ignores alignments with less than the Minimum Query Coverage.

(Default = 0.95)

--min-aln-identity Ignores alignments with less than the Minimum Alignment Identity.

(Default = 0.50)

--max-fuzzy-junction Ignores mismatches or indels shorter than or equal to N. (Default = 5)

--version Displays program version number and exits.

--log-file Writes the log to file. (Default = stderr)

--log-level Specifies the log level; values are [DEBUG, INFO, WARN, ERROR,

CRITICAL]. (Default = WARN)

-j,--num-threads Specifies the number of threads to use; 0 means autodetection.

(Default = 0)

Page 60

juliet

juliet is a general-purpose minor variant caller that identifies and phases

minor single nucleotide substitution variants in complex populations. It

identifies codon-wise variants in coding regions, performs a reference-

guided de novo variant discovery, and annotates known drug-resistance

mutations. Insertion and deletion variants are currently ignored; support

will be added in a future version. There is no technical limitation with

respect to the target organism or gene.

The underlying model is a statistical test, the Bonferroni-corrected Fisher's

Exact test. It compares the number of observed mutated codons to the

number of expected mutations at a given position.

juliet uses JSON target configuration files to define different genes in

longer reference sequences, such as overlapping open reading frames in

HIV. These predefined configurations ease batch applications and allow

immediate reproducibility. A target configuration may contain multiple

coding regions within one reference sequence and optional drug

resistance mutation positions.

Notes:

• The preinstalled target configurations are meant for a quick start. It is

the user's responsibility to ensure that the target configurations used

are correct and up-to-date.

• If the target configuration

none was specified, the provided reference

is assumed to be in-frame.

Performance

At a coverage of 6,000 CCS reads with a predicted accuracy (RQ) of

≥

0.99, the false positive and false negative rates are below 1% and

0.001% (10

-5

), respectively.

Usage

juliet --config "HIV" data.align.bam patientZero.html

Required Description

input_file.bam Input aligned BAM file containing CCS reads, which must be PacBio-

compliant, that is, cigar M is forbidden.

output_file.html Output report HTML file.

Configuration Description

--config,-c Path to the target configuration JSON file, predefined target configuration

tag, or the JSON string.

--mode-phasing,-p Phase variants and cluster haplotypes.

Page 61

Input files

• BAM-format files containing CCS reads. These must be PacBio-

compliant, that is,

cigar M is forbidden.

• Input CCS reads should have a minimal predicted accuracy of

0.99.

• Reads should be created with CCS analysis using the

--richQVs

option. Without the

--richQVs information, the number of false

positive calls might be higher, as

juliet is missing information to

filter actual heteroduplexes in the sample provided.

•

juliet currently does not demultiplex barcoded data; you must

provide one BAM file per barcode.

Output files

A JSON and/or HTML file:

juliet data.align.bam patientZero.html

juliet data.align.bam patientZero.json

juliet data.align.bam patientZero.html patientZero.json

The HTML file includes the same content as the JSON file, but in more

human-readable format. The HTML file contains four sections:

Restrictions Description

--region,-r Specifies the genomic region of interest; reads are clipped to that region.

Empty means all reads.

--drm-only,-k Only reports DRM positions specified in the target configuration. Can be

used to filter for drug-resistance mutations - only known variants from the

target configuration are called.

--min-perc,-m Specifies the minimum variant percentage to report.

Example: --min-perc 1 will only show variant calls with an observed

abundance of more than 1%. (Default = 0)

--max-perc,-n Specifies the maximum variant percentage to report.

Example: --max-perc 95 will only show variant calls with an observed

abundance of less than 95%. (Default = 100)

Chemistry override (specify both) Description

--sub,-s Specifies the substitution rate. Use to override the learned rate.

(Default = 0)

--del,-d Specifies the deletion rate. Use to override the learned rate. (Default = 0)

Options Description

--help, -h Displays help information and exits.

--verbose, -v Sets the verbosity level.

--version Displays program version number and exits.

--debug Returns all amino acids, irrespective of their relevance.

--mode-phasing,-p Phases variants and cluster haplotypes.

Page 62

1. Input data

Summarizes the data provided, the exact call for

juliet, and juliet

version for traceability purposes.

2. Target config

Summarizes details of the provided target configuration for traceability.

This includes the configuration version, reference name and length, and

annotated genes. Each gene name (in bold) is followed by the reference

start, end positions, and possibly known drug resistance mutations.

3. Variant discovery

For each gene/open reading frame, there is one overview table.

Each row represents a variant position.

• Each variant position consists of the reference codon, reference amino

acid, relative amino acid position in the gene, mutated codon,

percentage, mutated amino acid, coverage, and possible affected

drugs.

• Clicking the row displays counts of the multiple-sequence alignment

counts of the -3 to +3 context positions.

Page 63

4. Drug summaries

Summarizes the variants grouped by annotated drug mutations:

Predefined target configuration

juliet ships with one predefined target configuration, for HIV. Following

is the command syntax for running that predefined target configuration:

juliet --config "HIV" data.align.bam patientZero.html

Page 64

• Note: For the predefined configuration HIV, use the HIV HXB2

complete genome for alignment.

Customized target configuration

To define your own target configuration, create a JSON file. The root child

genes contains a list of coding regions, with begin and end, the name of

the gene, and a list of drug resistant mutations. Each DRM consists of its

name and the positions it targets. The

drms field is optional. If provided,

the

referenceSequence is used to call mutations, otherwise it will be

tested against the major codon. All indices are with respect to the

provided alignment space, 1-based, begin-inclusive and end-exclusive

[).

Target configuration: Example 1- A customized

json target configuration

file named

my_customized_hiv.json:

{

"genes": [

{

"begin": 2550,

"drms": [

{

"name": "fancy drug",

"positions": [ "M41L" ]

}

"end": 2700,

"name": "Reverse Transcriptase"

}

"referenceName": "my seq",

"referenceSequence": "TGGAAGGGCT...",

Page 65

"version": "Free text to version your config files"

"databaseVersion": "DrugDB version x.y.z (last updated YYYY-MM-DD)"

}

Run with a customized target configuration using the --config option:

juliet --config my_customized_hiv.json data.align.bam patientZero.html

Valid formats for DRMs/positions

103 Only the reference position.

M130 Reference amino acid and reference position.

M103L Reference aa, reference position, mutated aa.

M103LKA Reference aa, reference position, list of possible mutated aas.

103L Reference position and mutated aa.

103LG Reference position and list mutated aas.

Missing amino acids are processed as wildcard (*).

Example:

{ "name": "ATV/r", "positions": [ "V32I", "L33", "46IL",

"I54VTALM", "V82ATFS", "84" ] }

Target configuration: Example 2 - BCR-ABL:

For BCR-ABL, using the ABL1 gene with the following reference

NM_005157.5, a typical target configuration looks like this:

{

"genes": [

{

"name": "ABL1",

"begin": 193,

"end": 3585,

"drms": [

{

"name": "imatinib",

"positions": [

"T315AI","Y253H","E255KV","V299L","F317AICLV","F359CIV" ]

{

"name": "dasatinib",

"positions": [ "T315AI","V299L","F317AICLV" ]

{

"name": "nilotinib",

"positions": [ "T315AI","Y253H","E255KV","F359CIV" ]

{

"name": "bosutinib",

"positions": [ "T315AI" ]

}

]

}

"referenceName": "NM_005157.5",

"referenceSequence": "TTAACAGGCGCGTCCC..."

Page 66

No target configuration

If no target configuration is specified, either make sure that the sequence

is in-frame, or specify the region of interest to mark the correct reading

frame, so that amino acids are correctly translated. The output is labeled

with

unknown as the gene name:

juliet data.align.bam patientZero.html

Phasing

The default mode is to call amino-acid/codon variants independently.

Using the

--mode-phasing option, variant calls from distinct haplotypes

are clustered and visualized in the HTML output.

• The row-wise variant calls are "transposed" onto per-column

haplotypes. Each haplotype has an ID:

[A-Z]{1}[a-z]?.

• For each variant, colored boxes in this row mark haplotypes that

contain this variant.

• Colored boxes per haplotype/column indicate variants that co-occur.

Wild type (no variant) is represented by plain dark gray. A color palette

helps to distinguish between columns.

• The JSON variant positions has an additional

haplotype_hit boolean

array with the length equal to the number of haplotypes. Each entry

indicates if that variant is present in the haplotype. A haplotype block

under the root of the JSON file contains counts and read names. The

order of those haplotypes matches the order of all

haplotype_hit

arrays.

Page 67

There are two types of tooltips in the haplotype section of the table.

The first tooltip is for the Haplotypes % and shows the number of reads

that count towards (A) Actually reported haplotypes, (B) Haplotypes that

have less than 10 reads and are not being reported, and (C) Haplotypes

that are not suitable for phasing. Those first three categories are mutually

exclusive and their sum is the total number of reads going into

juliet.

For (C), the three different marginals provide insights into the sample

quality; as they are marginals, they are not exclusive and can overlap. The

following image shows a sample with bad PCR conditions:

The second type of tooltip is for each haplotype percentage and shows the

number of reads contributing to this haplotype:

Page 68

Microbial

Genome

Analysis

The Microbial Genome Analysis application is powered by the IPA HiFi

genome assembler and includes a base modification detection feature

performed after the assembly.

Workflow of the Microbial Genome Analysis application

The workflow consists of several steps around 2 main stages:

1. Chromosomal stage: Assemble large contigs using IPA, the HiFi

genome assembly tool.

2. Separate reads that were used for accurate large contigs from all other

reads.

3. Plasmid stage: Assemble plasmids (using IPA) from the reads sepa-

rated in the previous step.

4. De-duplicate plasmids.

5. Collect all contigs into a single FASTA file.

6. Rotate circular contigs.

7. Align the input Data Set to the assembled contigs.

8. Polish assembled contigs using

Racon.

9. Perform base modification detection.

The application accepts HiFi XML Data Sets as input, and has an

embedded downsampling feature:

• If the genome size and the desired coverage are specified, both stages

of assembly are downsampled, as with the Genome Assembly

application.

• Otherwise, the full coverage is used.

The embedded downsampling feature is not applied to the alignment

stage; all input reads will be aligned against the assembled contigs.

Usage

The Microbial Genome Analysis application is run using the pbcromwell

run command, with the

pb_microbial_analysis parameter to specify the

application. See “pbcromwell” on page 75 for details.

To view information on the available Microbial Genome Analysis options,

enter:

pbcromwell show-workflow-details pb_microbial_analysis

The minimum command needed to run the workflow requires the input

Data Set.

The following example performs assembly and base modification

detection using an input XML Data Set, and uses all default settings,

including 1 CPU:

pbcromwell run pb_microbial_analysis -e <input.consensusreadset.xml>

Page 69

Note: To specify different task options on the command line, consider the

following example:

pbcromwell run pb_microbial_analysis \

-e <input.consensusreadset.xml> \

--task-option ipa2_genome_size=0 \

--task-option ipa2_downsampled_coverage=0 \

--task-option microasm_plasmid_contig_len_max=300000

--task-option ipa2_cleanup_intermediate_files=True \

--task-option dataset_filters="" \

--task-option filter_min_qv=20 \

--nproc 8

The default options for this application should work well for any genome

type.

As microbes are relatively small, we rarely find much advantage to using

more than 4 threads, or more than 2 concurrent jobs.

Microbial Genome Analysis parameters

Option Default value Description

-e, --eid_ccs NONE This is a SMRT Link-specific input parameter and supports

only PacBio Consensusreadset XML files as input.

--task-option reads NONE Optional parameter, required if -e <input> is not specified.

Supports multiple input formats: FASTA, FASTQ, BAM, XML,

FOFN and gzipped versions of FASTA/FASTQ.

--task-option

ipa2_genome_size

10M The approximate number of base-pairs expected in the

chromosomal genome. Used only for downsampling in the

assembly stages (that is, not in polishing). If value <=0, then

downsampling is off.

Default: 10M (10 Mega basepairs).

Note: ipa2_genome_size is currently the only option that

accepts a metric suffix. Commas and decimals are not

accepted anywhere. ipa2_genome_size actually accepts a

String, which can be an integer followed by one of these

metric suffixes: k/M/G. For example: 4500k means “4,500

kilobases” or “4,500,000”. M stands for Mega and G stands for

Giga.

--task-option

ipa2_downsampled_coverage

100 The maximum coverage after downsampling with respect to

the estimated genome_size. (The default genome_size

was chosen to be larger than most realistic microbes.)

If value <=0, then downsampling is off.

Default: 100

Example: If your genome is 1G in length, and you specify

ipa2_genome_size=2G, you have over-estimated by 2x. If

you also specify ipa2_downsampled_coverage=100, your

data will be downsampled to 200x coverage, simply because

of the over-estimate. The cost of extra coverage is greater

runtime. The defaults are usually fine.

Page 70

--task-option

ipa2_advanced_options_chro

See Description

column

A semicolon-separated list of KEY=VALUE pairs. New line

characters are not accepted. (These are described later in

this document.)

config_block_size = 100;

config_seeddb_opt = -k 28 -w 20 --space 0 --use-

hpc-seeds-only;

config_ovl_opt = --one-hit-per-target --min-idt 98

--traceback --mask-hp --mask-repeats --trim --trim-

window-size 30 --trim-match-frac 0.75

--task-option

ipa2_advanced_options_plas

mid

See Description

column

A semicolon-separated list of KEY=VALUE pairs. New line

characters are not accepted. (These are described later in

this document.)

con-fig_block_size = 100;

con-fig_ovl_filter_opt = --max-diff 80 --max-cov

100 --min-cov 2 --bestn 10 --min-len 500 --gapFilt

--minDepth 4 --idt-stage2 98;

con-fig_ovl_min_len = 500;

con-fig_seeddb_opt = -k 28 -w 20 --space 0 --use-

hpc-seeds-only;

config_ovl_opt = --one-hit-per-target --min-idt 98

--min-map-len 500 --min-anchor-span 500 --traceback

--mask-hp --mask-repeats --trim --trim-window-size

30 --trim-match-frac 0.75 --smart-hit-per-target --

secondary-min-ovl-frac 0.05; con-fig_layout_opt = -

-allow-circular;

--task-option

ipa2_cleanup_intermediate_

files

TRUE Removes intermediate files from the run directory to save

space.

--task-option

microasm_plasmid_contig_le

n_max

300000 After the chromosomal stage, in task

filter_draft_contigs, separates long contigs

(presumed to be chromosomal) from shorter contigs (to be

re-assembled in the plasmid stage). Then, after the plasmid

stage, in dedup_plasmids, contigs less than this value are

ignored. The default value is usually fine.

--task-option

microasm_run_secondary_pol

ish

TRUE Specifies that an additional round of polishing will be applied

after the two stages of assembly have completed.

--task-option run_basemods TRUE Specifies that base modification analysis be performed.

--task-option

kineticstools_identify_mod

m4C,m6A Specify the base modifications to identify, in a comma-

separated list.

--task-option

kineticstools_p_value

0.001 Specifies the probability value cutoff for detecting base

modifications.

--task-option

motif_min_score

35 Specifies the minimum QMod score used to identify a motif.

--task-option

motif_min_fraction

0.30 Specifies the minimum methylated fraction to identify a motif.

--task-

optionrun_find_motifs

TRUE Specifies that motif-finding be performed.

--task-option

dataset_filters

NONE (General pbcromwell option) A semicolon-separated (not

comma-separated) list of other filters to add to the Data Set.

--task-option

filter_min_qv

20 (General pbcromwell option) Phred-scale integer QV cutoff

for filtering HiFi reads. The default for all applications is 20

(QV 20), or 99% predicted accuracy.

Option Default value Description

Page 71

--task-option

ipa2_advanced_options_chro

See Description

column

A semicolon-separated list of KEY=VALUE pairs. New line

characters are not accepted. (These are described later in

this document.)

config_block_size = 100;

config_seeddb_opt = -k 28 -w 20 --space 0 --use-

hpc-seeds-only;

config_ovl_opt = --one-hit-per-target --min-idt 98

--traceback --mask-hp --mask-repeats --trim --trim-

window-size 30 --trim-match-frac 0.75

--task-option

ipa2_advanced_options_plas

mid

See Description

column

A semicolon-separated list of KEY=VALUE pairs. New line

characters are not accepted. (These are described later in

this document.)

con-fig_block_size = 100;

con-fig_ovl_filter_opt = --max-diff 80 --max-cov

100 --min-cov 2 --bestn 10 --min-len 500 --gapFilt

--minDepth 4 --idt-stage2 98;

con-fig_ovl_min_len = 500;

con-fig_seeddb_opt = -k 28 -w 20 --space 0 --use-

hpc-seeds-only;

config_ovl_opt = --one-hit-per-target --min-idt 98

--min-map-len 500 --min-anchor-span 500 --traceback

--mask-hp --mask-repeats --trim --trim-window-size

30 --trim-match-frac 0.75 --smart-hit-per-target --

secondary-min-ovl-frac 0.05; con-fig_layout_opt = -

-allow-circular;

--task-option

ipa2_cleanup_intermediate_

files

TRUE Removes intermediate files from the run directory to save

space.

--task-option

microasm_plasmid_contig_le

n_max

300000 After the chromosomal stage, in task

filter_draft_contigs, separates long contigs

(presumed to be chromosomal) from shorter contigs (to be

re-assembled in the plasmid stage). Then, after the plasmid

stage, in dedup_plasmids, contigs less than this value are

ignored. The default value is usually fine.

--task-option

microasm_run_secondary_pol

ish

TRUE Specifies that an additional round of polishing will be applied

after the two stages of assembly have completed.

--task-option run_basemods TRUE Specifies that base modification analysis be performed.

--task-option

kineticstools_identify_mod

m4C,m6A Specify the base modifications to identify, in a comma-

separated list.

--task-option

kineticstools_p_value

0.001 Specifies the probability value cutoff for detecting base

modifications.

--task-option

motif_min_score

35 Specifies the minimum QMod score used to identify a motif.

--task-option

motif_min_fraction

0.30 Specifies the minimum methylated fraction to identify a motif.

--task-

optionrun_find_motifs

TRUE Specifies that motif-finding be performed.

--task-option

dataset_filters

NONE (General pbcromwell option) A semicolon-separated (not

comma-separated) list of other filters to add to the Data Set.

--task-option

filter_min_qv

20 (General pbcromwell option) Phred-scale integer QV cutoff

for filtering HiFi reads. The default for all applications is 20

(QV 20), or 99% predicted accuracy.

Option Default value Description

Page 72

Input files

• *.bam file containing PacBio data.

•

*.xml file containing PacBio data.

Output files

• final_assembly.fasta: File containing all assembled contigs,

rotated.

•

assembly.rotated.polished.renamed.fsa: File for NCBI, but

otherwise identical to

final_assembly.fasta; only the headers are

changed.

•

pb_microbial_analysis.motifs_csv: File containing motifs and

modifications.

Advanced assembly options

Note: Advanced parameters should be rarely modified. Advanced

parameters for the Microbial Genome Assembly workflow are the same as

those for the Genome Assembly workflow. See “Advanced parameters” on

page 44 for details.

Advanced parameters specified on the command line:

•Are in the form of

key = value pairs.

• Each pair is separated by a semicolon (

;) character.

• The full set of advanced parameters is surrounded by one set of

double quotes.

• The specified value of a parameter overwrites the default options for

that key. All desired options of that parameter must be explicitly listed,

not just the ones which should change from the default.

• Setting an empty value clears the parameter; it does not reset the

value back to default.

Example:

--task-option ipa2_advanced_options_chrom="config_seeddb_opt=-k

28;config_block_size=2048"

--task-option log_level INFO Specifies the logging (verbosity) level for tools which support

this feature. Values are: [TRACE, DEBUG, INFO, WARN,

FATAL]

--task-option nproc 1 Specified the number of CPU threads to use.

--task-option max_nchunks 40 Specifies the maximum number of chunks per task.

--task-option tmp_dir /tmp Specifies an optional temporary directory for used by several

subtools, such as sort.

Option Default value Description

Page 73

motifMaker

The motifMaker tool identifies motifs associated with DNA modifications

in prokaryotic genomes. Modified DNA in prokaryotes commonly arises

from restriction-modification systems that methylate a specific base in a

specific sequence motif. The canonical example is the m6A methylation

of adenine in GATC contexts in E. coli. Prokaryotes may have a very large

number of active restriction-modification systems present, leading to a

complicated mixture of sequence motifs.

PacBio SMRT sequencing is sensitive to the presence of methylated DNA

at single base resolution, via shifts in the polymerase kinetics observed in

the real-time sequencing traces. For more background on modification

detection, see here.

Algorithm

Existing motif-finding algorithms such as MEME-chip and YMF are sub-

optimal for this case for the following reasons:

• They search for a single motif, rather than attempting to identify a

complicated mixture of motifs.

• They generally don't accept the notion of aligned motifs - the input to

the tools is a window into the reference sequence which can contain

the motif at any offset, rather than a single center position that is

available with kinetic modification detection.

• Implementations generally either use a Markov model of the reference

(MEME-chip), or do exact counting on the reference, but place

restrictions on the size and complexity of the motifs that can be

discovered.

Following is a rough overview of the algorithm used by

motifMaker:

Define a motif as a set of tuples: (position relative to methylation, required

base). Positions not listed in the motif are implicitly degenerate. Given a

list of modification detections and a genome sequence, define the

following objective function on motifs:

Motif score(motif) = (# of detections matching motif) / (# of genome sites matching

motif) * (Sum of log-pvalue of detections matching motif) = (fraction methylated) * (sum

of log-pvalues of matches)

Then, search (close to exhaustively) through the space of all possible

motifs, progressively testing longer motifs using a branch-and-bound

search. The “fraction methylated” term must be less than 1, so the

maximum achievable score of a child node is the sum of scores of

modification hits in the current node, allowing pruning of all search paths

whose maximum achievable score is less than the best score discovered

so far.

Usage

Run the find command, and pass the reference FASTA and the

modifications.gff (.gz) file output by the PacBio modification detection

workflow.

Page 74

The reprocess subcommand annotates the GFF file with motif

information for better genome browsing.

MotifMaker [options] [command] [command options]

find Command: Run motif-finding.

find [options]

reprocess Command: Update a modifications.gff file with motif

information based on new Modification QV thresholds.

reprocess [options]

Output files

Using the find command:

• Output CSV file: This file has the same format as the standard Fields

included in motif_summary.csv described in the Methylome Analysis

White Paper.

Using the

reprocess command:

• Output GFF file: The format of the output file is the same as the input

file, and is described in the Methylome Analysis White Paper under

Fields included in the modifications.gff file.

Options Description

-h, --help Displays help information and exits.

* -f, --fasta Reference FASTA file.

* -g, --gff Modifications.gff or .gff.gz file.

-m, --minScore Specifies the minimum Qmod score to use in motif finding.

(Default = 40.0)

* -o, --output Outputs motifs.csv file.

-x, --xml Outputs motifs XML file.

Options Description

-c, --csv Raw modifications.csv file.

* -f, --fasta Reference FASTA file.

* -g, --gff Modifications.gff or .gff.gz file.

-m, --minFraction Specifies that only motifs above this methylated fraction are used.

(Default = 0.75)

-m, --motifs Motifs.csv file.

* -o, --output Reprocessed modifications.gff file.

Page 75

pbcromwell

The pbcromwell tool is PacBio’s wrapper for the cromwell scientific

workflow engine used to power SMRT Link. pbcromwell includes

advanced utilities for interacting directly with a Cromwell server.

pbcromwell is designed primarily for running workflows distributed and

supported by PacBio, but it is written to handle any valid WDL source

(version 1.0), and is very flexible in how it takes input. PacBio workflows

are expected to be found in the directory defined by the

SMRT_PIPELINE_BUNDLE_DIR environment variable, which is

automatically defined by the SMRT Link distribution.

Note that pbcromwell does not interact with SMRT Link services; to run a

Cromwell workflow as a SMRT Link job, use pbservice. (For details, see

“pbservice” on page 91.)

Note: Interaction with the

Cromwell server is primarily intended for

developers and power users.

Usage:

pbcromwell run [-h] [--output-dir OUTPUT_DIR] [--overwrite] [-i INPUTS]

[-e ENTRY_POINTS] [-n NPROC] [-c MAX_NCHUNKS]

[--target-size TARGET_SIZE] [--queue QUEUE] [-o OPTIONS]

[-t TASK_OPTIONS] [-b BACKEND] [-r MAX_RETRIES]

[--tmp-dir TMP_DIR] [--config CONFIG] [--dry-run]

[run,show-workflows,show-workflow-details,configure,submit,get-

job,abort,metadata,show-running,wait]

Options Description

--output-dir OUTPUT_DIR Specifies the output directory for cromwell output.

(Default = cromwell_out)

--overwrite Overwrites the output directory, if it exists. (Default = False)

-i INPUTS, --inputs INPUTS Specifies cromwell inputs and settings as JSON files. (Default = None)

-e ENTRY_POINTS, --entry

ENTRY_POINTS

Specifies the entry point Data Set; may be repeated for workflows that

take more than one input Data Set. Note that all PacBio workflows require

at least one such entry point.

-n NPROC, --nproc NPROC Specifies the number of processors per task. (Default = 1)

-c MAX_NCHUNKS, --max-nchunks

MAX_NCHUNKS

Specifies the maximum number of chunks per task. (Default = None)

--target-size TARGET_SIZE Specifies the target chunk size. (Default = None)

--queue QUEUE Specifies the cluster queue to use. (Default = None)

-o OPTIONS, --options OPTIONS Specifies additional cromwell engine options, as a JSON file.

(Default = None)

-t TASK_OPTIONS, --task-

option TASK_OPTIONS

Specifies workflow- or task-level option as key=value strings, specific to

the application. May be specified multiple times for multiple options.

(Default = [])

-b BACKEND, --backend BACKEND Specifies the backend to use for running tasks. (Default = None)

-r MAX_RETRIES, --maxRetries

MAX_RETRIES

Specifies the maximum number of times to retry a failing task.

(Default = 1)

Page 76

Enter pbcromwell {command} -h for a command's options.

Examples:

Show available PacBio-developed workflows:

pbcromwell show-workflows

Show details for a PacBio workflow:

pbcromwell show-workflow-details pb_ccs

Generate cromwell.conf with HPC settings:

pbcromwell configure --default-backend SGE --output-file cromwell.conf

Launch a PacBio workflow:

pbcromwell run pb_ccs -e /path/to/movie.subreadset.xml --nproc 8 --config /full/ path/

to/cromwell.conf

pbcromwell run Command: Run a Cromwell workflow. Multiple input

modes are supported, including a

pbsmrtpipe-like set of arguments (for

PacBio workflows only), and JSON files already in the native Cromwell

format.

All PacBio workflows have similar requirements to the

pbsmrtpipe

pipelines in previous SMRT Link versions:

--tmp-dir TMP_DIR Specifies an optional temporary directory for cromwell tasks, which

must exist on all compute hosts. (Default = None)

--config CONFIG Specifies a Java configuration file for running cromwell. (Default = None)

--dry-run Specifies that cromwell is not executed, but instead writes out final

inputs and then exits. (Default = True)

workflow Specifies the workflow ID (such as pb_ccs or

cromwell.workflows.pb_ccs for PacBio workflows only) or a path to

a Workflow Description Language (WDL) source file.

Options Description

-h, --help Displays help information and exits.

--version Displays program version number and exits.

--log-file LOG_FILE Writes the log to file. (Default = None, writes to stdout.)

--log-level=INFO Specifies the log level; values are [DEBUG, INFO, WARNING, ERROR,

CRITICAL.] (Default = INFO)

--debug Alias for setting the log level to DEBUG. (Default = False)

--verbose, -v Sets the verbosity level. (Default = None)

Page 77

1. One or more PacBio dataset XML entry points, usually a SubreadSet or

ConsensusReadSet (

--entry-point <FILE>.)

2. Any number of workflow-specific task options

(--task-option

3. Engine options independent of the workflow, such as number of pro-

cessors per task (

--nproc), or compute backend (--backend).

Output is directed to a new directory:

--output-dir, which defaults to

cromwell_out. This includes final inputs for the Cromwell CLI, and

subdirectories for logs (workflow and task outputs), links to output files,

and the Cromwell execution itself, which has a complex nested directory

structure. Detailed information about the workflow execution can be found

in the file

metadata.json, in the native Cromwell format.

Note that output file links do not include the individual resource files of

Data Sets and reports (BAM files, index files, plot PNGs, and so on.) Follow

the symbolic links to their real path (for example using

readlink -f) to

find report plots.

For further information about Cromwell, see the official documentation

here.

Workflow examples:

Run the CCS analysis:

pbcromwell run pb_ccs -e <SUBREADS> --nproc 8 --config /full/path/to/cromwell.conf

Run the Iso-Seq workflow, including mapping to a reference, and execute

on SGE:

pbcromwell run pb_isoseq3 -e <SUBREADS> -e <PRIMERS> -e <REFERENCE> --nproc 8 -- config

/full/path/to/cromwell.conf

Run a user-defined workflow:

pbcromwell run my_workflow.wdl -i inputs.json -o options.json --config /full/path/to/

cromwell.conf

Set up input files but exit before starting Cromwell:

pbcromwell run pb_ccs -e <SUBREADS> --nproc 8 --dry-run

Print details about the named PacBio workflow, including input files and

task options. Note: The prefix

cromwell.workflows. is optional.

pbcromwell show-workflow-details pb_ccs

pbcromwell show-workflow-details cromwell.workflows.pb_ccs

Page 78

pbcromwell show-workflow-details

Command: Display details about a

specified PacBio workflow, including input files and task options.

Usage:

pbcromwell show-workflow-details [-h] [--inputs-json INPUTS_JSON]

workflow_id

pbcromwell configure Command: Generate the Java configuration file

used by

cromwell to define backends and other important engine options

that cannot be set at runtime. You can pass this to

pbcromwell run using the --config argument.

Usage:

pbcromwell configure [-h] [--port PORT]

[--local-job-limit LOCAL_JOB_LIMIT]

[--jms-job-limit JMS_JOB_LIMIT]

[--db-port DB_PORT] [--db-user DB_USER]

[--db-password DB_PASSWORD]

[--default-backend DEFAULT_BACKEND]

[--max-workflows MAX_WORKFLOWS]

[--output-file OUTPUT_FILE] [--timeout TIMEOUT]

[--cache] [--no-cache]

Options Description

workflow_id Specifies the workflow ID, such as pb_ccs or

cromwell.workflows.pb_ccs for PacBio workflows only.

--inputs-json INPUTS_JSON Writes a JSON template containing workflow inputs to the specified file.

(Default = None)

Options Description

--port PORT Specifies the port that cromwell should listen to. (Default = 8000)

--local-job-limit

LOCAL_JOB_LIMIT

Specifies the maximum number of local jobs/tasks that can be run at

once. (Default = 10)

--jms-job-limit JMS_JOB_LIMIT Specifies the maximum number of jobs/tasks that can be submitted to the

queueing system at one time. (Default = 500)

--db-port DB_PORT Specifies the database port for cromwell to use; if undefined, database

configuration is omitted. (Default = None)

--db-user DB_USER Specifies the user name used to connect to Postgres.

(Default = smrtlink_user)

--db-password DB_PASSWORD Specifies the password used to connect to Postgres.

--default-backend

DEFAULT_BACKEND

Specifies the default job execution backend. Choices are: Local, SGE,

Slurm, PBS, or LSF. (Default = Local)

--max-workflows MAX_WORKFLOWS Specifies the maximum number of workflows that cromwell can run at

once. (Default = 100)

--output-file OUTPUT_FILE Specifies the name of the output configuration file.

(Default = cromwell.conf)

--timeout TIMEOUT Specifies the time to wait for task completion before checking the cluster

job status. (Default = 600)

Page 79

--cache Enables call caching. (Default = True)

--no-cache Disables call caching. (Default = True)

Options Description

Page 80

pbindex

The pbindex tool creates an index file that enables random access to

PacBio-specific data in BAM files.

Usage

pbindex <input>

Input file

• *.bam file containing PacBio data.

Output file

• *.pbi index file, with the same prefix as the input file name.

Options Description

-h, --help Displays help information and exits.

--version Displays program version number and exits.

Page 81

pbmarkdup

The pbmarkdup tool marks PCR duplicates in CCS reads Data Sets from

amplified libraries. PCR duplicates are different reads that arose from

amplifying a single-source molecule.

pbmarkdup can also optionally

remove the duplicate reads.

Note:

pbmarkdup only works with CCS reads, not with Subreads.

pbmarkdup uses a reference-free comparison method. Duplicates are

identified as pairs of reads that:

1. Have the same length - within 10 bp, and

2. Have high percent identity alignments at the molecule ends at >98%

identity of the first and last 250 bp.

Clusters are formed from sets of two or more duplicate reads, and a single

read is selected as the representative of each cluster. Other reads in the

cluster are considered duplicates.

How are duplicates marked?

In FASTA and FASTQ formats, reads from duplicate clusters have

annotated names. The following is a FASTA example:

>m64013_191117_050515/67104/ccs DUPLICATE=m64013_191117_050515/3802014/ccs DS=2

This shows a marked duplicate read m64013_191117_050515/67104/ccs

that is a duplicate of

m64013_191117_050515/3802014/ccs in a cluster

with 2 reads (

DS value). Accordingly, the following is the read selected as

the representative of the molecule:

>m64013_191117_050515/3802014/ccs DS=2

In BAM format, duplicate reads are flagged with 0x400. The read-level tag

ds provides the number of reads in a cluster (like the DS attribute

described above for FASTA/FASTQ), and the

du tag provides the name of

the representative read (like the

DUPLICATE attribute described above for

FASTA/FASTQ).

Usage

pbmarkdup [options] <INFILE.bam|xml|fa|fq|fofn> <OUTFILE.bam|xml|fa.gz|fq.gz>

Options Description

-h, --help Displays help information and exits.

--version Displays program version number and exits.

--log-file Logs to a file, instead of stderr.

Page 82

Input files

CCS reads from one or multiple movies in any of the following formats:

• PacBio BAM (

.ccs.bam)

• PacBio dataset (

.dataset.xml)

•File of File Names (

.fofn)

• FASTA (.fasta,.fasta.gz)

•

FASTQ (.fastq,.fastq.gz)

Output files

CCS reads with duplicates marked in a format inferred from the file

extension:

• PacBio BAM (

.ccs.bam)

• PacBio dataset (

.dataset.xml), which also generates a

corresponding BAM file.

• FASTA (.fasta.gz)

•

FASTQ (.fastq.gz)

--log-level Specifies the log level; values are [TRACE,DEBUG,INFO,WARN, FATAL]

(Default = WARN)

--log-level INFO produces a summary report such as:

LIBRARY READS UNIQUE MOLECULES DUPLICATE READS

----------------------------------------------------------------------------------------------------------

<Unnamed> 25000 24948 (99.8%) 52 (0.2%)

SS-lib 496 493 (99.4%) 3 (0.6%)

----------------------------------------------------------------------------------------------------------

TOTAL 25496 25441 (99.8%) 55 (0.2%)

-j,--num-threads Specifies the number of threads to use, 0 means autodetection.

(Default = 0)

Duplicate marking options Description

--cross-library, -x Identifies duplicate reads across sequencing libraries. Libraries are

specified in the BAM read group LB tag.

Output options Description

-rmdup, -r Excludes duplicates from OUTFILE. (This is redundant when

--dup-file is specified.)

--dup-file FILE Stores duplicate reads in an extra file other than OUTFILE. The format of

this file can be different from the output file.

--clobber, -f Overwrites OUTFILE if it exists.

Options Description

Page 83

Allowed input/output combinations

Examples

Run on a single movie:

pbmarkdup movie.ccs.bam output.bam

Run on multiple movies:

pbmarkdup movie1.fasta movie2.fasta output.fasta

Run on multiple movies and output duplicates in separate file:

pbmarkdup movie1.ccs.bam movie2.fastq uniq.fastq --dup-file dups.fasta

Input File Output BAM Output Dataset Output FASTQ Output FASTA

Input BAM

Allowed Allowed Allowed Allowed

Input Dataset

Allowed Allowed Allowed Allowed

Input FASTQ

Not Allowed Not Allowed Allowed Allowed

Input FASTA

Not Allowed Not Allowed Not Allowed Allowed

Page 84

pbmm2

The pbmm2 tool aligns native PacBio data, outputs PacBio BAM files, and

is a SMRT minimap2 wrapper for PacBio data.

pbmm2 supports native PacBio input and output, provides sets of

recommended parameters, generates sorted output on-the-fly, and post-

processes alignments. Sorted output can be used directly for polishing

using GenomicConsensus, if BAM has been used as input to pbmm2.

pbmm2 adds the following SAM tag to each aligned record:

•

mg, stores gap-compressed alignment identity, defined as

nM/(nM + nMM + nInsEvents + nDelEvents).

Usage

pbmm2 <tool>

index

Command: Indexes references and stores them as .mmi files.

Indexing is optional, but recommended if you use the same reference with

the same

--preset multiple times.

Usage:

pbmm2 index [options] <ref.fa|xml> <out.mmi>

Input file

• *.fasta, *.fa file containing reference contigs or

*.referenceset.xml.

Output file

• out.mmi (minimap2 index file.)

Notes:

• You can use existing minimap2 .mmi files with pbmm2 align.

• If you use an index file, you cannot override parameters

-k, -w, nor -u

pbmm2 align.

•The

minimap2 parameter -H (homopolymer-compressed k-mer) is

always on for SUBREAD and UNROLLED presets, and can be disabled

using

-u.

Options Description

-h, --help Displays help information and exits.

--version Displays program version number and exits.

Page 85

align

Command: Aligns PacBio reads to reference sequences. The

output argument is optional; if not provided, the BAM output is streamed

stdout.

Usage:

pbmm2 align [options] <ref.fa|xml|mmi> <in.bam|xml|fa|fq> [out.aligned.bam|xml]

Input files

• *.fasta file containing reference contigs, or *.referenceset.xml, or

*.mmi index file.

•

*.bam, *.subreadset.xml, *.consensusreadset.xml,

*.transcriptset.xml, *.fasta, *.fa, *.fastq, or *.fastq file

containing PacBio data.

Output files

• *.bam aligned reads in BAM format.

•

*.alignmentset, *.consensusalignmentset.xml, or

*.transcriptalignmentset.xml if XML output was chosen.

The following Data Set Input/output combinations are allowed:

SubreadSet > AlignmentSet

pbmm2 align hg38.referenceset.xml movie.subreadset.xml hg38.movie.alignmentset.xml

ConsensusReadSet > ConsensusAlignmentSet

pbmm2 align hg38.referenceset.xml movie.consensusreadset.xml

hg38.movie.consensusalignmentset.xml --preset CCS

Options Description

--preset Specifies the alignment mode:

• "SUBREAD" -k 19 -w 10

• "CCS" -k 19 -w 10 -u

• "ISOSEQ" -k 15 -w 5 -u

• "UNROLLED" -k 15 -w 15

The option is not case-sensitive. (Default = SUBREAD)

-k Specifies the k-mer size, which cannot be larger than 28. (Default = -1)

-w Specifies the Minimizer window size. (Default = -1)

-u,--no-kmer-compression Disables homopolymer-compressed k-mer. (Compression is on by default

for the SUBREAD and UNROLLED presets.)

--short-sa-cigar Populates the SA tag with the short cigar representation.

Page 86

TranscriptSet > TranscriptAlignmentSet

pbmm2 align hg38.referenceset.xml movie.transcriptset.xml

hg38.movie.transcriptalignmentset.xml --preset ISOSEQ

FASTA/FASTQ input

In addition to native PacBio BAM input, reads can also be provided in

FASTA and FASTQ formats.

Attention: The resulting output BAM file cannot be used as input into

GenomicConsensus!

With FASTA/FASTQ input, the

--rg option sets the read group. Example:

pbmm2 align hg38.fasta movie.Q20.fastq hg38.movie.bam --preset CCS --rg

'@RG\tID:myid\tSM:mysample'

All three reference file formats .fasta, .referenceset.xml, and .mmi

can be combined with FASTA/FASTQ input.

Options Description

-h, --help Displays help information and exits.

--chunk-size Processes

N records per chunk. (Default = 100)

--sort Generates a sorted BAM file.

-m,--sort-memory Specifies the memory per thread for sorting. (Default = 768M)

-j,--alignment-threads Specifies the number of threads used for alignment. 0 means

autodetection. (Default = 0)

-J,--sort-threads Specifies the number of threads used for sorting. 0 means 25% of -j, with

a maximum of 8. (Default = 0)

--sample Specifies the sample name for all read groups. Defaults, in order of

precedence: A) SM field in the input read group B) Biosample name

C) Well sample name D) "UnnamedSample".

--rg Specifies the read group header line such as '@RG\tID:xyz\tSM:abc'.

Only for FASTA/Q inputs.

-y,--min-gap-comp-id-perc Specifies the minimum gap-compressed sequence identity, in percent.

(Default = 70)

-l,--min-length Specifies the minimum mapped read length, in base pairs. (Default = 50)

-N,--best-n Specifies the output at maximum N alignments for each read. 0 means no

maximum. (Default = 0)

--strip Removes all kinetic and extra QV tags. The output cannot be polished.

--split-by-sample Specifies one output BAM file per sample.

--no-bai Omits BAI index file generation for sorted output.

--unmapped Specifies that unmapped records be included in the output.

--median-filter Picks one read per ZMW of median length.

--zmw Processes ZMW Reads; subreadset.xml input is required. This

activates the UNROLLED preset.

Page 87

Examples:

Generate an index file for reference and reuse it to align reads:

pbmm2 index ref.fasta ref.mmi

pbmm2 align ref.mmi movie.subreads.bam ref.movie.bam

Align reads and sort on-the-fly, with 4 alignment and 2 sort threads:

pbmm2 align ref.fasta movie.subreads.bam ref.movie.bam --sort -j 4 -J 2

Align reads, sort on-the-fly, and create a PBI:

--hqregion Processes the HQ region of each ZMW; subreadset.xml input is

required. This activates the UNROLLED preset.

Parameter set options and

overrides

Description

--preset Specifies the alignment mode:

• "SUBREAD" -k 19 -w 10 -o 5 -O 56 -e 4 -E 1 -A 2 -B 5

-z 400 -Z 50 -r 2000 -L 0.5

• "CCS" -k 19 -w 10 -u -o 5 -O 56 -e 4 -E 1 -A 2 -B 5 -

z 400 -Z 50 -r 2000 -L 0.5

• "ISOSEQ" -k 15 -w 5 -u -o 2 -O 32 -e 1 -E 0 -A 1 -B 2

-z 200 -Z 100 -C 5 -r 200000 -G 200000 -L 0.5

• "UNROLLED" -k 15 -w 15 -o 2 -O 32 -e 1 -E 0 -A 1 -B 2

-z 200 -Z 100 -r 2000 -L 0.5

(Default = SUBREAD)

-k Specifies the k-mer size, which cannot be larger than 28. (Default = -1)

-w Specifies the Minimizer window size. (Default = -1)

-u,--no-kmer-compression Disables homopolymer-compressed k-mer. (Compression is on by default

for the SUBREAD and UNROLLED presets.)

-A Specifies the matching score. (Default = -1)

-B Specifies the mismatch penalty. (Default = -1)

-z Specifies the Z-drop score. (Default = -1)

-Z Specifies the Z-drop inversion score. (Default = -1)

-r Specifies the bandwidth used in chaining and DP-based alignment.

(Default = -1)

-o,--gap-open-1 Specifies the gap open penalty 1. (Default = -1)

-O,--gap-open-2 Specifies the gap open penalty 2. (Default = -1)

-e,--gap-extend-1 Specifies the gap extension penalty 1. (Default = -1)

-E,--gap-extend-2 Specifies the gap extension penalty 2. (Default = -1)

-L,--lj-min-ratio Specifies the long join flank ratio. (Default = -1)

-G Specifies the maximum intron length; this changes -r. (Default = -1)

-C Specifies the cost for a non-canonical GT-AG splicing. (Default = -1)

--no-splice-flank Specifies that you do not prefer splicing flanks GT-AG.

Options Description

Page 88

pbmm2 align ref.fasta movie.subreadset.xml ref.movie.alignmentset.xml --sort

Omit the output file and stream the BAM output to stdout:

pbmm2 align hg38.mmi movie1.subreadset.xml | samtools sort > hg38.movie1.sorted.bam

Align the CCS reads fastq input and sort the output:

pbmm2 align ref.fasta movie.Q20.fastq ref.movie.bam --preset CCS --sort --rg

'@RG\tID:myid\tSM:mysample'

Alignment parallelization

The number of alignment threads can be specified using the -j,

--alignment-threads

option. If not specified, the maximum number of

threads are used, minus one thread for BAM I/O and minus the number of

threads specified for sorting.

Sorting

Sorted output can be generated using the --sort option.

• By default, 25% of threads specified with the

-j option (Maximum = 8)

are used for sorting.

• To override the default percentage, the

-J,--sort-threads option

defines the explicit number of threads used for on-the-fly sorting. The

memory allocated per sort thread is defined using the

-m,--sort-

memory

option, accepting suffixes M,G.

Benchmarks on human data show that 4 sort threads are recommended,

but that no more than 8 threads can be effectively leveraged, even with 70

cores used for alignment. We recommend that you provide more memory

to each of a few sort threads to avoid disk I/O pressure, rather than

providing less memory to each of many sort threads.

What are parameter sets and how can I override them?

Per default, pbmm2 uses recommended parameter sets to simplify the

multitudes of possible combinations. See the available parameter sets in

the option table shown earlier.

What other special parameters are used implicitly?

We implicitly use the following minimap2 parameters:

• Soft clipping with

-Y.

• Long cigars for tag CG with -L.

• X/= cigars instead of M with --eqx.

• No overlapping query intervals with repeated matches trimming.

• No secondary alignments are produced using the

--secondary=no

option.

Page 89

What is repeated matches trimming?

A repeated match occurs when the same query interval is shared between

a primary and supplementary alignment. This can happen for

translocations, where breakends share the same flanking sequence:

And sometimes, when a LINE gets inserted, the flanks are duplicated,

leading to complicated alignments, where we see a split read sharing a

duplication. The inserted region itself, mapping to a random other LINE in

the reference genome, may also share sequence similarity to the flanks:

To get the best alignments,

minimap2 decides that two alignments may

use up to 50% (default) of the same query bases. This does not work for

PacBio, as

pbmm2 requires that a single base may never be aligned twice.

Minimap2 offers a feature to enforce a query interval overlap to 0%. If a

query interval gets used in two alignments, one or both get flagged as

secondary and get filtered. This leads to yield loss, and more importantly,

missing SVs in the alignment.

Papers (such as this) present dynamic programming approaches to find

the optimal split to uniquely map query intervals, while maximizing

alignment scores. We don't have per base alignment scores available, thus

our approach is much simpler. We align the read, find overlapping query

intervals, determine one alignment to be maximal reference-spanning,

then trim all others. By trimming,

pbmm2 rewrites the cigar and the

reference coordinates on-the-fly. This allows us to increase the number of

mapped bases, which slightly reduces mapped concordance, but boosts

SV recall rate.

How can I set the sample name?

You can override the sample name (SM field in the RG tag) for all read

groups using the

--sample option. If not provided, sample names derive

from the Data Set input using the following order of precedence: A)

field in the input read group B) Biosample name C) Well sample name

UnnamedSample. If the input is a BAM file and the --sample option was

not used, the

SM field is populated with UnnamedSample.

Page 90

Can I split output by sample name?

Yes, the --split-by-sample option generates one output BAM file per

sample name, with the sample name as the file name prefix, if there is

more than one aligned sample name.

Can I remove all those extra per-base and per-pulse tags?

Yes, the --strip option removes the following extraneous tags if the

input is BAM:

dq, dt, ip, iq, mq, pa, pc, pd, pe, pg, pm, pq,

pt, pv, pw, px, sf, sq, st

. Note that the resulting output BAM file

cannot be used as input into

GenomicConsensus.

Where are the unmapped reads?

Per default, unmapped reads are omitted. You can add them to the output

BAM file using the

--unmapped option.

Can I output at maximum the N best alignments per read?

Use the option -N, --best-n. If set to 0, (the default), maximum filtering

is disabled.

Is there a way to only align one subread per ZMW?

Using the --median-filter option, only the subread closest to the

median subread length per ZMW is aligned. Preferably, full-length

subreads flanked by adapters are chosen.

Page 91

pbservice

The pbservice tool performs a variety of useful tasks within SMRT Link.

•To get help for

pbservice, use pbservice -h.

• To get help for a specific

pbservice command, use

pbservice <command> -h.

Note: pbservice requires authentication when run from a remote host,

using the same credentials used to log in to the SMRT Link GUI. (This also

routes all requests through HTTPS port 8243, so the services port is not

required if authentication is used.) Access to services running on

localhost will continue to work without authentication.

All

pbservice commands include the following optional parameters:

status Command: Use to get system status.

pbservice status [-h] [--host HOST] [--port PORT]

[--log-file LOG_FILE]

[--log-level INFO}

[--debug] [--quiet]

import-dataset Command: Import Local Data Set XML. The file location

must be accessible from the host where the services are running; often on

a shared file system.

pbservice import-dataset [-h] [--host HOST] [--port PORT]

[--log-file LOG_FILE]

[--log-level INFO]

[--debug] [--quiet]

xml_or_dir

Options Description

--host=http://localhost Specifies the server host. Override the default with the environmental

variable PB_SERVICE_HOST.

--port=8070 Specifies the server port. Override the default with the environmental

variable PB_SERVICE_PORT.

--log-file LOG_FILE Writes the log to file. (Default = None, writes to stdout.)

--log-level=INFO Specifies the log level; values are [DEBUG, INFO, WARNING, ERROR,

CRITICAL.] (Default = INFO)

--debug=False Alias for setting the log level to DEBUG. (Default = False)

--quiet=False Alias for setting the log level to CRITICAL to suppress output.

(Default = False)

--user USERNAME Specifies the user to authenticate as; this is required if the target host is

anything other than localhost.

--ask-pass Prompts the user to enter a password.

--password PASSWORD Supplies the password directly. This exposes the password in the shell

history (or log files), so this option is not recommended unless you are

using a limited account without Unix login access.

Page 92

import-run

Command: Create a SMRT Link Run Design and optional Auto

Analysis jobs from a CSV file. This is equivalent to the Run Design CSV

import feature in the SMRT Link UI; the resulting runs can then be edited in

the UI or executed on-instrument.

pbservice import-run [-h][--host HOST] [--port PORT]

[--log-file LOG_FILE]

[--log-level INFO]

[--debug] [--quiet]

[--dry-run]

csv_file

import-fasta Command: Import a FASTA file and convert to a

ReferenceSet file. The file location must be accessible from the host

where the services are running; often on a shared file system.

pbservice import-fasta [-h] --name NAME --organism ORGANISM --ploidy

PLOIDY [--block] [--host HOST] [--port PORT]

[--log-file LOG_FILE]

[--log-level INFO]

[--debug] [--quiet]

fasta_path

run-analysis Command: Run a secondary analysis pipeline using an

analysis.json file.

pbservice run-analysis [-h] [--host HOST] [--port PORT]

[--log-file LOG_FILE]

[--log-level INFO]

Required Description

xml_or_dir Specifies a directory or XML file for the Data Set.

Required Description

csv_file Specifies a Run Design CSV-format file.

Option Description

--dry-run Prints the generated run XML without POSTing to the server.

Required Description

fasta_path Path to the FASTA file to import.

Options Description

--name Specifies the name of the ReferenceSet to convert the FASTA file to.

--organism Specifies the name of the organism.

--ploidy Ploidy.

--block=False Blocks during importing process.

Page 93

[--debug] [--quiet] [--block]

json_path

emit-analysis-template Command: Output an analysis.json

template to

stdout that can be run using the run-analysis command.

pbservice emit-analysis-template [-h] [--log-file LOG_FILE]

[--log-level INFO]

[--debug] [--quiet]

get-job Command: Get a job summary by Job Id.

pbservice get-job [-h] [--host HOST] [--port PORT]

[--log-file LOG_FILE]

[--log-level INFO]

[--debug] [--quiet]

job_id

get-jobs Command: Get job summaries by Job Id.

pbservice get-jobs [-h] [-m MAX_ITEMS] [--host HOST] [--port PORT]

[--log-file LOG_FILE]

[--log-level INFO]

[--debug] [--quiet]

get-dataset Command: Get a Data Set summary by Data Set Id or UUID.

pbservice get-dataset [-h] [--host HOST] [--port PORT]

[--log-file LOG_FILE]

[--log-level INFO]

[--debug] [--quiet]

id_or_uuid

get-datasets Command: Get a Data Set list summary by Data Set type.

pbservice get-datasets [-h] [--host HOST] [--port PORT]

[--log-file LOG_FILE]

[--log-level INFO]

[--debug] [--quiet] [-m MAX_ITEMS]

Required Description

json_path Path to the analysis.json file.

Options Description

--block=False Blocks during importing process.

Required Description

job_id Job id or UUID.

Options Description

-m=25, --max-items=25 Specifies the maximum number of jobs to get.

Required Description

id_or_uuid Data Set Id or UUID.

Page 94

[-t DATASET_TYPE]

delete-dataset Command: Delete a specified Data Set.

Note: This is a "soft" delete - the database record is tagged as inactive so it

won't display in any lists, but the files will not be removed.

pbservice delete-dataset [-h] [--host HOST] [--port PORT]

[--log-file LOG_FILE]

[--log-level INFO]

[--debug] [--quiet]

[ID]

Examples

To obtain system status, the Data Set summary, and the job summary:

pbservice status --host smrtlink-release --port 9091

To import a Data Set XML:

pbservice import-dataset --host smrtlink-release --port 9091 \

path/to/subreadset.xml

To obtain a job summary using the Job Id:

pbservice get-job --host smrtlink-release --port 9091 \

--log-level CRITICAL 1

To obtain Data Sets by using the Data Set type subreads:

pbservice get-datasets --host smrtlink-alpha --port 8081 \

--quiet --max-items 1 -t subreads

To obtain Data Sets by using the Data Set type alignments:

pbservice get-datasets --host smrtlink-alpha --port 8081 \

--quiet --max-items 1 -t alignments

To obtain Data Sets by using the Data Set type references:

pbservice get-datasets --host smrtlink-alpha --port 8081 \

--quiet --max-items 1 -t references

To obtain Data Sets by using the Data Set type barcodes:

pbservice get-datasets --host smrtlink-alpha --port 8081 \

--quiet --max-items 1 -t barcodes

Required Description

-t=subreads, --dataset-

type=subreads

Specifies the type of Data Set to retrieve: subreads, alignments,

references, barcodes.

Required Description

ID A valid Data Set ID, either UUID or integer ID, for the Data Set to delete.

Page 95

To obtain Data Sets by using the Data Set UUID:

pbservice get-dataset --host smrtlink-alpha --port 8081 \

--quiet 43156b3a-3974-4ddb-2548-bb0ec95270ee

Page 96

pbsv

pbsv is a structural variant caller for PacBio reads. It identifies structural

variants and large indels (Default: ≥20 bp) in a sample or set of samples

relative to a reference.

pbsv identifies the following types of variants:

Insertions, deletions, duplications, copy number variants, inversions, and

translocations.

pbsv takes as input read alignments (BAM) and a reference genome

(FASTA); it outputs structural variant calls (VCF).

Usage:

pbsv [-h] [--version] [--quiet] [--verbose]

{discover,call}...

pbsv discover Command: Finds structural variant (SV) signatures in

read alignments. The input read alignments must be sorted by chromo-

some position. Alignments are typically generated with

pbmm2. The output

SVSIG file contains SV signatures.

Usage:

pbsv discover [options] <ref.in.bam|xml> <ref.out.svsig.gz>

Options Description

-h, --help Displays help information and exits.

--version Displays program version number and exits.

--log-file Logs to a file, instead of stdout.

--log-level Specifies the log level; values are [TRACE,DEBUG,INFO, WARN, FATAL.]

(Default = WARN)

discover Finds structural variant signatures in read alignments (BAM to SVSIG).

call Calls structural variants from SV signatures and assign genotypes (SVSIG

to VCF).

Required Description

ref.in.bam|xml Coordinate-sorted aligned reads in which to identify SV signatures.

ref.out.svsig.gz Structural variant signatures output.

Options Description

-h, --help Displays help information and exits.

-s,--sample Overrides sample name tag from BAM read group.

-q,--min-mapq Ignores alignments with mapping quality < N. (Default = 20)

-m,--min-ref-span Ignores alignments with reference length < N bp. (Default = 100)

-w,--downsample-window-length Specifies a window in which to limit coverage, in base pairs.

(Default = 10K)

-a,--downsample-max-

alignments

Considers up to N alignments in a window; 0 means disabled.

(Default = 100)

Page 97

pbsv call

Command: Calls structural variants from SV signatures and

assigns genotypes. The input SVSIG file is generated using

pbsv dis-

cover

. The output is structural variants in VCF format.

Usage:

pbsv call [options] <ref.fa|xml> <ref.in.svsig.gz|fofn>

<ref.out.vcf>

-r,--region Limits discovery to this reference region: CHR|CHR:START-END.

-l,--min-svsig-length Ignores SV signatures with length < N bp. (Default = 7)

-b,--tandem-repeats Specifies tandem repeat intervals for indel clustering, as an input BED file.

-k,--max-skip-split Ignores alignment pairs separated by > N bp of a read or reference.

(Default = 100)

-y,--min-gap-comp-id-perc Ignores alignments with gap-compressed sequence identity < N%.

(Default = 70)

Options Description

Required Description

ref.fa|xml Reference FASTA file or ReferenceSet XML file against which to call

variants.

ref.in.svsig.gz|fofn SV signatures from one or more samples. This can be either an SV

signature SVSIG file generated by pbsv discover, or a FOFN of SVSIG

files.

ref.out.vcf Variant call format (VCF) output file.

Options Description

-h, --help Displays help information and exits.

-j,--num-threads Specifies the number of threads to use, 0 means autodetection.

(Default = 0)

-r,--region Limits discovery to this reference region: CHR|CHR:START-END

-t,--types Calls these SV types: "DEL", "INS", "INV", "DUP", "BND", "CNV".

(Default = “DEL,INS,INV,DUP,BND”)

-m,--min-sv-length Ignores variants with length < N bp. (Default = 20)

--max-ins-length Ignores insertions with length > N bp. (Default = 15K)

--max-dup-length Ignores duplications with length > N bp. (Default = 1M)

--cluster-max-length-perc-

diff

Does not cluster signatures with difference in length > P%. (Default = 25)

--cluster-max-ref-pos-diff Does not cluster signatures > N bp apart in the reference. (Default = 200)

--cluster-min-basepair-perc-

Does not cluster signatures with base pair identity < P%. (Default = 10)

-x,--max-consensus-coverage Limits to N reads for variant consensus. (Default = 20)

-s,--poa-scores Scores POA alignment with triplet match,mismatch,gap.

(Default = “1,-2,-2“)

--min-realign-length Considers segments with > N length for realignment. (Default = 100)

Page 98

Following is a typical SV analysis workflow starting from subreads:

1. Align PacBio reads to a reference genome, per movie:

Subreads BAM input:

pbmm2 align ref.fa movie1.subreads.bam ref.movie1.bam --sort --median-filter --sample

sample1

CCS reads BAM input:

pbmm2 align ref.fa movie1.ccs.bam ref.movie1.bam --sort --preset CCS --sample sample1

CCS reads FASTQ input:

pbmm2 align ref.fa movie1.Q20.fastq ref.movie1.bam --sort --preset CCS --sample sample1

--rg '@RG\tID:movie1'

2. Discover the signatures of structural variation, per movie or per

sample:

pbsv discover ref.movie1.bam ref.sample1.svsig.gz

pbsv discover ref.movie2.bam ref.sample2.svsig.gz

3. Call structural variants and assign genotypes (all samples); for CCS

analysis input append

--ccs:

pbsv call ref.fa ref.sample1.svsig.gz ref.sample2.svsig.gz

ref.var.vcf

-A, --call-min-reads-all-

samples

Ignores calls supported by < N reads total across samples. (Default = 3)

-O, --call-min-reads-one-

sample

Ignores calls supported by < N reads in every sample. (Default = 3)

-S, --call-min-reads-per-

strand-all-samples

Ignores calls supported by < N reads per strand total across samples.

(Default = 1)

-B,--call-min-bnd-reads-all-

samples

Ignores BND calls supported by < N reads total across samples.

(Default = 2)

-P, --call-min-read-perc-one-

sample

Ignores calls supported by < P% of reads in every sample. (Default = 20)

--preserve-non-acgt Preserves non-ACGT in REF allele instead of replacing with N.

(Default = false)

--hifi, --ccs Uses options optimized for HiFi reads: -S 0 -P 10. (Default = False)

--gt-min-reads Specifies the minimum supporting reads to assign a sample a non-

reference genotype. (Default = 1)

--annotations Annotates variants by comparing with sequences in FASTA.

(Default annotations are ALU, L1, and SVA.)

--annotation-min-perc-sim Annotates variant if sequence similarity > P%. (Default = 60)

--min-N-in-gap Considers

≥ N consecutive "N" bp as a reference gap. (Default = 50)

--filter-near-reference-gap Flags variants < N bp from a gap as "NearReferenceGap".

(Default = 1000)

--filter-near-contig-end Flags variants < N bp from a contig end as “NearContigEnd”.

(Default = 1K)

Options Description

Page 99

Launching a multi-sample pbsv analysis - requirements

1. Merge multiple Bio Sample SMRT Cells to one Data Set with the Bio

Samples specified.

– Each SMRT Cell must have exactly one Bio Sample name - multiple

Bio Sample names cannot be assigned to one SMRT Cell.

– Multiple SMRT Cells can have the same Bio Sample name.

– All of the inputs need to already have the appropriate Bio Sample

records in their

CollectionMetadata. If they don't, they are treated

as a single sample.

2. Create a ReferenceSet from a FASTA file.

– The ReferenceSet is often already generated and registered in SMRT

Link.

– If the ReferenceSet doesn’t exist, use the

dataset create

command to create one:

dataset create --type ReferenceSet --name reference_name reference.fasta

Launching a multi-sample analysis

# Set subreads and ref FASTA

sample1=sample1.subreadset.xml sample2=sample2.subreadset.xml

ref=reference.fasta

pbmm2 align ${ref} ${sample1} sample1.bam --sort --median-filter --sample Sample1

pbmm2 align ${ref} ${sample2} sample2.bam --sort --median-filter --sample Sample2

samtools index sample1.bam

samtools index sample2.bam

pbindex sample1.bam

pbindex sample2.bam

pbsv discover sample1.bam sample1.svsig.gz

pbsv discover sample2.bam sample2.svsig.gz

pbsv call ${ref} sample1.svsig.gz sample2.svsig.gz out.vcf

out.vcf: A pbsv VCF output file, where columns starting from column 10

represent structural variants of Sample 1 and Sample 2:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2

chr01 222737 pbsv.INS.1 T TTGGTGTTTGTTGTTTTGTTTT . PASS

SVTYPE=INS;END=222737;SVLEN=21;SVANN=TANDEM GT:AD:DP 0/1:6,4:10 0/1:6,5:11

Page 100

pbvalidate

The pbvalidate tool validates that files produced by PacBio software are

compliant with PacBio’s own internal specifications.

Input files

pbvalidate supports the following input formats:

•BAM

•FASTA

• Data Set XML

See here for further information about each format's requirements.

Usage

pbvalidate [-h] [--version] [--log-file LOG_FILE]

[--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL} | --debug | --quiet | -v]

[-c] [--quick] [--max MAX_ERRORS]

[--max-records MAX_RECORDS]

[--type {BAM,Fasta,AlignmentSet,ConsensusSet,ConsensusAlignmentSet,

SubreadSet,BarcodeSet,ContigSet,ReferenceSet,HdfSubreadSet}]

[--index] [--strict] [-x XUNIT_OUT] [--unaligned]

[--unmapped] [--aligned] [--mapped]

[--contents {SUBREAD,CCS}] [--reference REFERENCE]

file

Required Description

file Input BAM, FASTA, or Data Set XML file to validate.

Options Description

-h, --help Displays help information and exits.

--version Displays program version number and exits.

--log-file LOG_FILE Writes the log to file. Default (None) will write to stdout.

--log-level Specifies the log level; values are [DEBUG, INFO, WARNING, ERROR,

CRITICAL.] (Default = CRITICAL)

--debug=False Alias for setting the log level to DEBUG. (Default = False)

--quiet Alias for setting the log level to CRITICAL to suppress output.

(Default = False)

--verbose, -v Sets the verbosity level. (Default = None)

--quick Limits validation to the first 100 records (plus file header); equivalent to

--max-records=100. (Default = False)

--max MAX_ERRORS Exits after MAX_ERRORS were recorded.

(Default = None; checks the entire file.)

--max-records MAX_RECORDS Exits after MAX_RECORDS were inspected.

(Default = None; checks the entire file.)

--type Uses the specified file type instead of guessing.

[BAM,Fasta,AlignmentSet,ConsensusSet,ConsensusAlignmen

tSet,SubreadSet,BarcodeSet,ContigSet,ReferenceSet,

HdfSubreadSet] (Default = None)

Page 101

Examples

To validate a BAM file:

pbvalidate in.subreads.bam

To validate a FASTA file:

pbvalidate in.fasta

To validate a Data Set XML file:

pbvalidate in.subreadset.xml

To validate a BAM file and its index file (.pbi):

pbvalidate --index in.subreads.bam

To validate a BAM file and exit after 10 errors are detected:

pbvalidate --max 10 in.subreads.bam

To validate up to 100 records in a BAM file:

pbvalidate --max-records 100 in.subreads.bam

To validate up to 100 records in a BAM file

(equivalent to

--max-records=100):

pbvalidate --quick in.subreads.bam

To validate a BAM file, using a specified log level:

pbvalidate --log-level=INFO in.subreads.bam

--index Requires index files:.fai or .pbi. (Default = False)

--strict Turns on additional validation, primarily for Data Set XML.

(Default = False)

BAM options Description

--unaligned Specifies that the file should contain only unmapped alignments.

(Default = None, no requirement.)

--unmapped Alias for --unaligned. (Default = None)

--aligned Specifies that the file should contain only mapped alignments.

(Default = None, no requirement.)

--mapped Alias for --aligned. (Default = None)

--contents Enforces the read type: [SUBREAD, CCS] (Default = None)

--reference REFERENCE Specifies the path to an optional reference FASTA file, used for additional

validation of mapped BAM records. (Default = None)

Options Description

Page 102

To validate a BAM file and write log messages to a file rather than to

stdout:

pbvalidate --log-file validation_results.log in.subreads.bam

Page 103

primrose

The primrose tool analyzes the kinetic signatures of cytosine bases in

CpG motifs to identify the presence of 5mC.

primrose uses a convolution neural network (CNN) to predict the

methylation state (5mC) of each CpG in a HiFi read. Methylation is

assumed to be symmetric between strands with output reported in the

forward direction with respect to the HiFi read sequence. The output uses

the

Mm and Ml tags as defined in the SAM Format Optional Fields

Specification. (See here for details.)

Algorithm

The CNN is trained using invitro modified controls of methylated and

unmethylated human DNA. The unmethylated control comprises a human

shotgun library that has undergone whole genome amplification (WGA), a

process that includes PCR, and therefore removes all modifications. The

methylated control is generated by subjecting a human WGS library to

invitro methylation using a CpG methyltransferase enzyme (

M. SssI).

Kinetic data, pulse width and inter-pulse distance, over a 16 base pair

window for both the forward and reverse strand, is used as input to the

CNN. The output of the CNN is a probability scale measure of whether the

CpG is symmetrically 5mC-modified.

Usage

primrose [options] <HiFi INPUT> <HiFi OUTPUT>

Example

primrose --log-level INFO movie.hifi.bam movie.hifi_5mc.bam

Required Description

HiFi INPUT Input BAM file or ConsensusReadSet XML file.

HiFi OUPUT Output BAM file or ConsensusReadSet XML file.

Options Description

--keep-kinetics Specifies that the kinetics tracks (IPD and PulseWidth records) fi, fp, fn,

ri, rp and rn are included in the output BAM file.

--min-passes Specifies the minimum number of passes. (Default = 3)

--model Specifies a path to a trained TensorFlow model directory, or to an

exported ONNX model file.

-h, --help Displays help information and exits.

--version Displays program version number and exits.

-j,--num-threads Specifies the number of threads to use when processing; 0 means

autodetection. (Default = 0)

--log-file LOG_FILE Writes the log to file. Default (None) will write to stderr.

--log-level Specifies the log level; values are [TRACE, DEBUG, INFO, WARNING, ,

FATAL.] (Default = WARNING)

Page 104

runqc-reports

The runqc-reports tool generates up to five different Run QC reports,

depending on Data Set type: Raw Data, Adapters, Loading, Control, and

CCS reads. Generating a complete set of reports requires the presence of

sts.xml resource in the Data Set, but either the CCS Analysis report (or

a fallback Subreads report) will always be generated. All report JSON and

plot PNG files are written to the current working directory, unless

otherwise specified.

Usage

runqc-reports [-h] [--version] [--log-file LOG_FILE]

[--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]

[| --debug | --quiet | -v]

[-o OUTPUT_DIR]

dataset_xml

Examples

runqc-reports moviename.subreadset.xml

runqc-reports moviename.consensusreadset.xml

Required Description

dataset_xml Input SubreadSet or ConsensusReadSet XML, which must contain an

sts.xml resource for the full Run QC report set to be generated.

-o OUTPUT_DIR Output report directory. (Default = Current working directory)

Options Description

-h, --help Displays help information and exits.

--version Displays program version number and exits.

--log-file LOG_FILE Writes the log to file. Default (None) will write to stdout.

--log-level Specifies the log level; values are [DEBUG, INFO, WARNING, ERROR,

CRITICAL.] (Default = WARNING)

--debug Alias for setting the log level to DEBUG. (Default = False)

--quiet Alias for setting the log level to CRITICAL to suppress output.

(Default = False)

--verbose, -v Sets the verbosity level. (Default = None)

Page 105

summarize

Modifications

The summarizeModifications tool generates a GFF summary file

(

alignment_summary.gff) from the output of base modification analysis

(for example,

ipdSummary) combined with the coverage summary GFF

generated by resequencing pipelines. This is useful for power users

running custom workflows.

Usage

summarizeModifications [-h] [--version]

[--log-file LOG_FILE]

[--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL} | --debug

| --quiet | -v]

modifications alignmentSummary gff_out

Input files

• modifications: Base Modification GFF file.

•

alignmentSummary: Alignment Summary GFF file.

Output files

• gff_out: Coverage summary for regions (bins) spanning the reference

with Base Modification results for each region

Options Description

-h, --help Displays help information and exits.

--version Displays program version number and exits.

--log-file LOG_FILE Writes the log to file. Default (None) will write to stdout.

--log-level Specifies the log level; values are [DEBUG, INFO, WARNING, ERROR,

CRITICAL] (Default = INFO)

--debug Alias for setting the log level to DEBUG. (Default = False)

--quiet Alias for setting the log level to CRITICAL to suppress output.

(Default = False)

--verbose, -v Sets the verbosity level. (Default = None)

Page 106

Appendix A - Application entry points and output files

Note: To print information about a specific PacBio workflow, including

input files and task options, use the

pbcromwell show-workflow-

details

command, which is available for all applications. Example:

pbcromwell show-workflow-details pb_hgap4

pbcromwell show-workflow-details cromwell.workflows.pb_hgap4

(The prefix cromwell.workflows is optional.)

CCS Analysis

Analysis application name: cromwell.workflows.pb_ccs

Entry point

:id: eid_subread

:name: Entry eid_subread

:fileTypeId: PacBio.DataSet.SubreadSet

Key output files

Demultiplex

Barcodes

Analysis application name: cromwell.workflows.pb_demux_ccs

Entry points

:id: eid_ccs

:name: Entry eid_ccs

:fileTypeId: PacBio.DataSet.ConsensusReadSet

:id: eid_barcode

:name: Entry eid_barcode

:fileTypeId: PacBio.DataSet.BarcodeSet

Key output files

File name Datastore SourceId

FASTQ file ccs_fastq_out

FASTA file ccs_fasta_out

<moviename>.hifi.reads.bam file ccs_bam_out

Consensus Sequences pb_ccs.ccsxml

CCS Analysis Statistics pb_ccs.report_ccs

All Reads (BAM) reads_bam

<moviename>.hifi.reads.fasta ccs_fasta_out

<moviename>.hifi.reads.fastq ccs_fastq_out

File name Datastore SourceId

Barcode Report Details pb_demux_ccs.summary_csv

Demultiplexed Datasets pb_demux_ccs.demuxed_files_datastore

Unbarcoded Reads pb_demux_ccs.unbarcoded

Page 107

Export Reads

Analysis application name: cromwell.workflows.pb_export_ccs

Entry point

:id: eid_ccs

:name: Entry eid_ccs

:fileTypeId: PacBio.DataSet.ConsensusReadSet

Key output files

Note: If users select a lower cutoff Phred QV value, the string

hifi is

replaced by the QV value in the file names.

Example:

<moviename>.q10.fastq.gz.

Genome

Assembly

Analysis application name: cromwell.workflows.pb_assembly_hifi

Entry point

:id: eid_ccs

:name: Entry eid_ccs

:fileTypeId: PacBio.DataSet.ConsensusReadSet

Key output files

HiFi Mapping

Analysis application name: cromwell.workflows.pb_ccs_subreads

Entry points

:id: eid_ccs

:name: Entry eid_ccs

:fileTypeId: PacBio.DataSet.ConsensusReadSet

:id: eid_ref_dataset

:name: Entry eid_ref_dataset

:fileTypeId: PacBio.DataSet.ReferenceSet

Key output files

File name Datastore SourceId

<moviename>.hifi_reads.fastq.gz ccs_fastq_out

<moviename>.hifi_reads.fasta.gz ccs_fasta_out

<moviename>.hifi_reads.bam.gz ccs_bam_out

File name Datastore SourceId

Final Polished Assembly,

Primary Contigs

pb_assembly_hifi.final_primary_contigs_fasta

Final Polished Assembly,

Haplotigs

pb_assembly_hifi.final_haplotigs_fasta

List of Circular Contigs pb_assembly_hifi.circular_contigs

Summary Report pb_assembly_hifi.report_polished_assembly

File name Datastore SourceId

Mapped reads pb_align_ccs.mapped

Coverage summary pb_align_ccs.coverage_gff

Page 108

HiFiViral SARS-

CoV-2 Analysis

Analysis application name: cromwell.workflows.pb_sars_cov2_kit

Entry points

:id: eid_ccs (HiFi reads, demultiplexed as separate BAM files)

:name: Entry eid_ccs

:fileTypeId: PacBio.DataSet.ConsensusReadSet

:id: eid_ref_dataset_2 (Reference Genome)

:name: Entry eid_ref_dataset

:fileTypeId: PacBio.DataSet.ReferenceSet

Key output files

Iso-Seq

Analysis

Analysis application name: cromwell.workflows.pb_isoseq3_ccsonly

Entry points

:id: eid_ccs

:name: Reads

:fileTypeId: PacBio.DataSet.ConsensusReadSet

:id: eid_barcode

:name: Primers

:fileTypeId: PacBio.DataSet.BarcodeSet

:id: eid_ref_dataset

:name: Reference (Optional)

:fileTypeId: PacBio.DataSet.ReferenceSet

File name Datastore SourceId

All Samples, Probe Counts TSV pb_sars_cov2_kit.probe_counts_zip

All Samples, Raw Variant Calls

VCF

pb_sars_cov2_kit.raw_vcf_zip

All Samples, Variant Call VCF pb_sars_cov2_kit.vcf_zip

All Samples, Variant Calls CSV pb_sars_cov2_kit.variants_csv

All Samples, Consensus

Sequence FASTA

pb_sars_cov2_kit.fasta_zip

All Samples, Consensus

Sequence By Fragments FASTA

pb_sars_cov2_kit.frag_fasta_zip

All Samples, Aligned Reads BAM pb_sars_cov2_kit.mapped_zip

All Samples, Consensus

Sequence Aligned BAM

pb_sars_cov2_kit.aligned_frag_zip

All Samples, Trimmed HiFi reads

FASTQ

pb_sars_cov2_kit.trimmed_zip

Failed Sample Info pb_sars_cov2_kit.sample_failures_csv

Failed Sample Analysis Logs pb_sars_cov2_kit.errors_zip

Demultiplex Summaries pb_sars_cov2_kit.lima_summary_zip

Sample Summary Table CSV pb_sars_cov2_kit.summary_csv

All Samples, Genome Coverage

Plots

pb_sars_cov2_kit.coverage_png_zip

Page 109

Key output files

Mark PCR

Duplicates

Analysis application name: cromwell.workflows.pb_mark_duplicates

Entry points

:id: eid_ccs

:name: Entry eid_ccs

:fileTypeId: PacBio.DataSet.ConsensusReadSet

Key output files

File name Datastore SourceId

Collapsed Filtered Isoforms FASTQ pb_isoseq3_ccsonly.collapse_fastq

Collapsed Filtered Isoforms GFF pb_isoseq3_ccsonly.collapse_gff

Group TXT pb_isoseq3_ccsonly.collapse_group

Abundance TXT pb_isoseq3_ccsonly.collapse_abundance

Read Stat TXT pb_isoseq3_ccsonly.collapse_readstat

Collapsed Isoforms Abundance TXT

(Files are numbered consecutively,

1 for each barcoded sample.)

pb_isoseq3_ccsonly.collapse_abundance

Separate clusters, one per barcoded sample.

Collapsed Isoforms Abundance TXT

(Files are numbered consecutively,

1 for each barcoded sample.)

pb_isoseq3_ccsonly.collapse_abundance

Pooled clusters, one per barcoded sample.

High-Quality Transcripts pb_isoseq3_ccsonly.hq_fastq

Low-Quality Transcripts pb_isoseq3_ccsonly.lq_fastq

High-Quality Transcripts Counts

(Files are numbered consecutively,

1 for each barcoded sample.)

pb_isoseq3_ccsonly.barcode_overview_report

Separate clusters, one per barcoded sample.

High-Quality Transcripts Counts

(Files are numbered consecutively,

1 for each barcoded sample.)

pb_isoseq3_ccsonly.barcode_overview_report

Pooled clusters, one per barcoded sample.

CCS reads FASTQ pb_isoseq3_ccsonly.ccs_fastq_zip

Full-length CCS reads pb_isoseq3._ccsonly.flnc_bam

Polished Report pb_isoseq3._ccsonly.polish_report_csv

Cluster Report pb_isoseq3._ccsonly.report_isoseq

File name Datastore SourceId

PCR Duplicates pb_mark_duplicates.duplicates

Deduplicated reads pb_mark_duplicates.deduplicated

In the SMRT Link UI, this displays as <ORIGINAL_DATASET_NAME>

(deduplicated).

Page 110

Microbial

Genome

Analysis

Analysis application name: cromwell.workflows.pb_microbial_analysis

Entry point

:id: eid_ccs

:name: Entry eid_ccs

:fileTypeId: PacBio.DataSet.ConsensusreadSet

Key output files

Minor Variants

Analysis

Analysis application name: cromwell.workflows.pb_mv_ccs

Entry points

:id: eid_ccs

:name: Entry eid_ccs

:fileTypeId: PacBio.DataSet.ConsensusReadSet

:id: eid_ref_dataset

:name: Entry eid_ref_dataset

:fileTypeId: PacBio.DataSet.ReferenceSet

Key output files

File name Datastore SourceId

Final Polished Assembly pb_microbial_analysis.assembly_fasta

Final Polished Assembly Index pb_microbial_analysis.assembly_fasta.fai

Final Polished Assembly for NCBI pb_microbial_analysis.ncbi_fasta

Mapped BAM pb_microbial_analysis.mapped

Target sequences used to produce

the Mapped BAM

pb_microbial_analysis.mapped_target_fasta

Target sequence Index used to

produce the Mapped BAM

pb_microbial_analysis.mapped_target_fasta_fai

List of Circular Contigs pb_microbial_analysis.circular_list

Coverage Summary pb_microbial_analysis.coverage_gff

Coverage Report pb_microbial_analysis.report_coverage

Mapping Statistics Report pb_microbial_analysis.report_mapping_stats

Mapped BAM Datastore pb_microbial_analysis.mapped_bam_datastore

Polished Assembly Report pb_microbial_analysis.report_polished_assembly

Full Kinetics Summary pb_microbial_analysis.basemods_gff

IPD Ratios pb_microbial_analysis.basemods_csv

Motifs and Modifications pb_microbial_analysis.motifs_csv

Motifs and Modifications pb_microbial_analysis.motifs_gff

File name Datastore SourceId

Minor Variants HTML Reports pb_mv_ccs.juliet_html

Per-Variant Table pb_mv_ccs.report_csv

Alignments pb_mv_ccs.mapped

Page 111

Structural

Variant Calling

Analysis application name: cromwell.workflows.pb_sv_ccs

Entry points

:id: eid_ccs

:name: Entry eid_ccs

:fileTypeId: PacBio.DataSet.ConsensusReadSet

:id: eid_ref_dataset

:name: Entry eid_ref_dataset

:fileTypeId: PacBio.DataSet.ReferenceSet

Key output files

Trim Ultra-Low

Adapters

Analysis application name: cromwell.workflows.pb_trim_adapters

Entry points

:id: eid_ccs

:name: Entry eid_ccs

:fileTypeId: PacBio.DataSet.ConsensusReadSet

:id: eid_barcode

:name: Entry eid_barcode

:fileTypeId: PacBio.DataSet.BarcodeSet

Note: The barcodes need to be a single primer sequence.

Key output files

5mC CpG

Detection

Analysis application name: cromwell.workflows.pb_detect_methyl

Entry points

:id: eid_ccs

:name: Entry eid_ccs

:fileTypeId: PacBio.DataSet.ConsensusReadSet

:id: eid_barcode

:name: Entry eid_barcode

:fileTypeId: PacBio.DataSet.BarcodeSet

File name Datastore SourceId

Structural Variants pb_sv_ccs.variants

Aligned reads (Bio Sample

Name)

pb_sv_ccs.alignments_by_sample_datastore

File name Datastore SourceId

Reads Missing Adapters pb_trim_adapters.unbarcoded

PCR Adapter Data CSV pb_trim_adapters.summary_csv

Trimmed reads pb_trim_adapters.trimmed

In the SMRT Link UI, this displays as <ORIGINAL_DATASET_NAME>

(trimmed).

Page 112

Key output files

File name Datastore SourceId

HiFi reads with 5mC Calls pb_detect_methyl.ccsbam

Page 113

Appendix B - Third party command-line tools

Following is information on the third-party command-line tools included in

the

smrtcmds/bin subdirectory.

bamtools

• A C++ API and toolkit for reading, writing, and manipulating BAM files.

•See https://sourceforge.net/projects/bamtools/ for details.

cromwell

• Scientific workflow engine used to power SMRT Link.

•See https://cromwell.readthedocs.io/en/stable/ for details.

ipython

• An interactive shell for using the PacBio API.

•See https://ipython.org/ for details.

purge_dups

• Removes haplotigs and contig overlaps in a de novo assembly based

on read depth.

•See https://github.com/dfguan/purge_dups for details.

python

• An object-oriented programming language.

•See https://www.python.org/ for details.

samtools,

BCFtools

• A set of programs for interacting with high-throughput sequencing

data in SAM/BAM/CRAM/VCF/BCF2 formats.

•See http://www.htslib.org/ for details.

Page 114

Appendix C - Microbial Genome Analysis advanced options (HiFi reads

only)

Use this application to generate de novo assemblies of small prokaryotic

genomes between 1.9-10 Mb and companion plasmids between 2 – 220

kb from Hifi (CCS) reads.

The workflow also performs base modification analysis of the assembled

genome.

The Microbial Genome Analysis application:

• Includes chromosomal- and plasmid-level de novo genome assembly,

circularization, polishing, and rotation of the origin of replication for

each circular contig.

• Performs base modification detection (m4C and m6A).

• Facilitates assembly of larger genomes (yeast) as well.

• Accepts HiFi read data (BAM format) as input.

The workflow consists of two assembly stages: chromosomal and

plasmid.

Chromosonal stage: Intended for contig assembly of large sequences.

This stage uses more stringent filtering (using advanced options) to

produce contiguous assemblies of complex regions, but it may miss small

sequences in the input sample (such as plasmids.)

Plasmid stage: Intended for a fine-grained assembly. This stage

assembles only the unmapped and poorly mapped reads. It also relaxes

the overlapping parameters, using advanced options.

Both stages use an automated random subsampling process to reduce

the input Data Set for assembly (by default to 100x). Note that the

subsampling is only applied to the contig construction process, while the

alignment and reports stages of the workflow still uses the full input Data

Set.

Available options for the two assembly stages are identical. The only

differences are:

1. Chromosonal stage parameters are prefixed with:

ipa2_advanced_options_chrom. In the SMRT Link interface for the

Microbial Genome Analysis application, this corresponds to the

Advanced Assembly Options for chromosomal stage field.

2. Plasmid stage parameters are prefixed with:

ipa2_advanced_options_plasmid. In the SMRT Link interface for the

Microbial Genome Analysis application, this corresponds to the

Advanced Assembly Options for plasmid stage field.

Page 115

The same sub-options are available to each stage, but the defaults are

very different. The current defaults are:

ipa2_advanced_options_chrom =

"config_block_size = 100; config_seeddb_opt = -k 28 -w 20 --space 0 --use-hpc-seeds-

only; config_ovl_opt = --one-hit-per-target --min-idt 98 --traceback --mask-hp --mask-

repeats --trim --trim-window-size 30 --trim-match-frac 0.75;"

ipa2_advanced_options_plasmid =

"config_block_size = 100; config_ovl_filter_opt = --max-diff 80 --max-cov 100 --min-cov

2 --bestn 10 --min-len 500 --gapFilt --minDepth 4 --idt-stage2 98; config_ovl_min_len =

500; config_seeddb_opt = -k 28 -w 20 --space 0 --use-hpc-seeds-only; config_ovl_opt = -

-one-hit-per-target --min-idt 98 --min-map-len 500 --min-anchor-span 500 --traceback --

mask-hp --mask-repeats --trim --trim-window-size 30 --trim-match-frac 0.75 --smart-hit-

per-target --secondary-min-ovl-frac 0.05; config_layout_opt = --allow-circular;"

Options are separated by semicolons; within each option, parameters are

separated by spaces.

Users should not need to modify any of default options. If the defaults are

modified, workflow behavior could be very different.

Note: The options available in

ipa2_advanced_options_* are exactly the

same as the

config_* options available for the Genome Assembly tool.

See “Genome Assembly parameters input files” on page 43 for details.