CLI#

FinaleToolkit is a package and standalone program to extract fragmentation features of cell-free DNA from paired-end sequencing data.

usage: finaletoolkit [-h] [-v]
                     {coverage,frag-length-bins,frag-length-intervals,cleavage-profile,wps,adjust-wps,delfi,delfi-gc-correct,end-motifs,interval-end-motifs,mds,interval-mds,filter-bam,agg-bw,gap-bed}
                     ...

Named Arguments#

-v, --version: show program’s version number and exit

Sub-commands#

coverage#

Calculates fragmentation coverage over intervals defined in a BED file based on alignment data from a BAM/SAM/CRAM/Fragment file.

finaletoolkit coverage [-h] [-o OUTPUT_FILE] [-s SCALE_FACTOR] [-q QUALITY_THRESHOLD] [-w WORKERS] [-v] input_file interval_file

Positional Arguments#

input_file: Path to a BAM/SAM/CRAM/Fragment file containing fragment data.
interval_file: Path to a BED file containing intervals to calculate coverage over.

Named Arguments#

-o, --output_file

A BED file containing coverage values over the intervals specified in interval file.

Default: “-”

-s, --scale-factor

Scale factor for coverage values.

Default: 1000000.0

-q, --quality_threshold

Minimum mapping quality threshold.

Default: 30

-w, --workers

Number of worker processes.

Default: 1

-v, --verbose

Enable verbose mode to display detailed processing information.

Default: False

frag-length-bins#

Retrieves fragment lengths grouped in bins given a BAM/SAM/CRAM/Fragment file.

finaletoolkit frag-length-bins [-h] [-c CONTIG] [-S START] [-p {midpoint,any}] [-E STOP] [--bin-size BIN_SIZE] [-o OUTPUT_FILE] [--contig-by-contig]
                               [--histogram] [-q QUALITY_THRESHOLD] [-v]
                               input_file

Positional Arguments#

input_file: Path to a BAM/SAM/CRAM/Fragment file containing fragment data.

Named Arguments#

-c, --contig

Specify the contig or chromosome to select fragments from. (Required if using –start or –stop.)

-S, --start

Specify the 0-based left-most coordinate of the interval to select fragments from. (Must also specify –contig.)

-p, --intersect_policy

Possible choices: midpoint, any

Specifies what policy is used to include fragments in the given interval. See User Guide for more information.

Default: “midpoint”

-E, --stop

Specify the 1-based right-most coordinate of the interval to select fragments from. (Must also specify –contig.)

--bin-size

Specify the size of the bins to group fragment lengths into.

-o, --output_file

A .TSV file containing containing fragment lengths binned according to the specified bin size.

Default: “-”

--contig-by-contig

Placeholder, not implemented.

Default: False

--histogram

Enable histogram mode to display histogram in terminal.

Default: False

-q, --quality_threshold

Minimum mapping quality threshold.

Default: 30

-v, --verbose

Enable verbose mode to display detailed processing information.

Default: 0

frag-length-intervals#

Retrieves fragment length summary statistics over intervals defined in a BED file based on alignment data from a BAM/SAM/CRAM/Fragment file.

finaletoolkit frag-length-intervals [-h] [-p {midpoint,any}] [-o OUTPUT_FILE] [-q QUALITY_THRESHOLD] [-w WORKERS] [-v] input_file interval_file

Positional Arguments#

input_file: Path to a BAM/SAM/CRAM/Fragment file containing fragment data.
interval_file: Path to a BED file containing intervals to retrieve fragment length summary statistics over.

Named Arguments#

-p, --intersect_policy

Possible choices: midpoint, any

Specifies what policy is used to include fragments in the given interval. See User Guide for more information.

Default: “midpoint”

-o, --output-file

A BED file containing fragment length summary statistics (mean, median, st. dev, min, max) over the intervals specified in the interval file.

Default: “-”

-q, --quality-threshold

Minimum mapping quality threshold.

Default: 30

-w, --workers

Number of worker processes.

Default: 1

-v, --verbose

Enable verbose mode to display detailed processing information.

Default: 0

cleavage-profile#

Calculates cleavage proportion over intervals defined in a BED file based on alignment data from a BAM/SAM/CRAM/Fragment file.

finaletoolkit cleavage-profile [-h] [-o OUTPUT_FILE] [-lo FRACTION_LOW] [-hi FRACTION_HIGH] [-q QUALITY_THRESHOLD] [-l LEFT] [-r RIGHT] [-w WORKERS]
                               [-v]
                               input_file interval_file

Positional Arguments#

input_file: Path to a BAM/SAM/CRAM/Fragment file containing fragment data.
interval_file: Path to a BED file containing intervals to calculates cleavage proportion over.

Named Arguments#

-o, --output_file

A bigWig file containing the cleavage proportion results over the intervals specified in interval file.

Default: “-”

-lo, --fraction_low

Minimum length for a fragment to be included in cleavage proportion calculation.

Default: 120

-hi, --fraction_high

Maximum length for a fragment to be included in cleavage proportion calculation.

Default: 180

-q, --quality-threshold

Minimum mapping quality threshold.

Default: 20

-l, --left

Number of base pairs to subtract from start coordinate to create interval. Useful when dealing with BED files with only CpG coordinates.

Default: 0

-r, --right

Number of base pairs to add to stop coordinate to create interval. Useful when dealing with BED files with only CpG coordinates.

Default: 0

-w, --workers

Number of worker processes.

Default: 1

-v, --verbose

Enable verbose mode to display detailed processing information.

Default: 0

wps#

Calculates Windowed Protection Score (WPS) over intervals defined in a BED file based on alignment data from a BAM/SAM/CRAM/Fragment file.

finaletoolkit wps [-h] [-o OUTPUT_FILE] [-i INTERVAL_SIZE] [-W WINDOW_SIZE] [-lo FRACTION_LOW] [-hi FRACTION_HIGH] [-q QUALITY_THRESHOLD]
                  [-w WORKERS] [-v]
                  input_file site_bed

Positional Arguments#

input_file: Path to a BAM/SAM/CRAM/Fragment file containing fragment data.
site_bed: Path to a BED file containing intervals to calculate WPS over.

Named Arguments#

-o, --output_file

A bigWig file containing the WPS results over the intervals specified in interval file.

Default: “-”

-i, --interval_size

Size in bp of each interval in the interval file.

Default: 5000

-W, --window_size

Size of the sliding window used to calculate WPS scores.

Default: 120

-lo, --fraction_low

Minimum length for a fragment to be included in WPS calculation.

Default: 120

-hi, --fraction_high

Maximum length for a fragment to be included in WPS calculation.

Default: 180

-q, --quality_threshold

Minimum mapping quality threshold.

Default: 30

-w, --workers

Number of worker processes.

Default: 1

-v, --verbose

Enable verbose mode to display detailed processing information.

Default: 0

adjust-wps#

Adjusts raw Windowed Protection Score (WPS) by applying a median filter and Savitsky-Golay filter.

finaletoolkit adjust-wps [-h] [-o OUTPUT_FILE] [-i INTERVAL_SIZE] [-m MEDIAN_WINDOW_SIZE] [-s SAVGOL_WINDOW_SIZE] [-p SAVGOL_POLY_DEG] [-w WORKERS]
                         [--mean] [--subtract-edges] [-v]
                         input_file interval_file genome_file

Positional Arguments#

input_file: A bigWig file containing the WPS results over the intervals specified in interval file.
interval_file: Path to a BED file containing intervals to WPS was calculated over.
genome_file: A .chrom.sizes file containing chromosome sizes.

Named Arguments#

-o, --output-file

A bigWig file containing the adjusted WPS results over the intervals specified in interval file.

Default: “-”

-i, --interval_size

Size in bp of each interval in the interval file.

Default: 5000

-m, --median-window-size

Size of the median filter window used to adjust WPS scores.

Default: 1000

-s, --savgol-window-size

Size of the Savitsky-Golay filter window used to adjust WPS scores.

Default: 21

-p, --savgol-poly-deg

Degree polynomial for Savitsky-Golay filter.

Default: 2

-w, --workers

Number of worker processes.

Default: 1

--mean

A mean filter is used instead of median.

Default: False

--subtract-edges

Take the median of the first and last 500 bases in a window and subtract from the whole interval.

Default: False

-v, --verbose

Enable verbose mode to display detailed processing information.

delfi#

Calculates DELFI features over genome, returning information about (GC-corrected) short fragments, long fragments, DELFI ratio, and total fragments.

finaletoolkit delfi [-h] [-b BLACKLIST_FILE] [-g GAP_FILE] [-o OUTPUT_FILE] [-G] [-R] [-M] [-q QUALITY_THRESHOLD] [-w WORKERS] [-v]
                    input_file autosomes reference_file bins_file

Positional Arguments#

input_file: Path to a BAM/SAM/CRAM/Fragment file containing fragment data.
autosomes: Tab-delimited file containing (1) autosome name and (2) integer length of chromosome in base pairs.
reference_file: The .2bit file for the associate reference genome sequence used during alignment.
bins_file: A BED file containing bins over which to calculate DELFI. To replicate Cristiano et al.’s methodology, use 100kb bins over human autosomes.

Named Arguments#

-b, --blacklist-file

BED file containing regions to ignore when calculating DELFI.

-g, --gap-file

BED4 format file containing columns for “chrom”, “start”,”stop”, and “type”. The “type” column should denote whether the entry corresponds to a “centromere”, “telomere”, or “short arm”, and entries not falling into these categories are ignored. This information corresponds to the “gap” track for hg19 in the UCSC Genome Browser.

-o, --output-file

BED, bed.gz, TSV, or CSV file to write DELFI data to. If “-”, writes to stdout.

Default: “-”

-G, --no-gc-correct

Skip GC correction.

Default: True

-R, --keep-nocov

Skip removal two regions in hg19 with no coverage. Use this flag when not using hg19 human reference genome.

Default: True

-M, --no-merge-bins

Keep 100kb bins and do not merge to 5Mb size.

Default: True

-q, --quality-threshold

Minimum mapping quality threshold.

Default: 30

-w, --workers

Number of worker processes.

Default: 1

-v, --verbose

Enable verbose mode to display detailed processing information.

Default: 0

delfi-gc-correct#

Performs gc-correction on raw delfi data.

finaletoolkit delfi-gc-correct [-h] [-o OUTPUT_FILE] [--header-lines HEADER_LINES] [-v] input_file

Positional Arguments#

input_file: BED file containing raw DELFI data. Raw DELFI data should only have columns for “contig”, “start”, “stop”, “arm”, “short”, “long”, “gc”, “num_frags”, “ratio”.

Named Arguments#

-o, --output-file

BED to print GC-corrected DELFI fractions. If “-”, will write to stdout.

Default: “-”

--header-lines

Number of header lines in BED.

Default: 1

-v, --verbose

Enable verbose mode to display detailed processing information.

end-motifs#

Measures frequency of k-mer 5’ end motifs.

finaletoolkit end-motifs [-h] [-k K] [-o OUTPUT_FILE] [-q QUALITY_THRESHOLD] [-w WORKERS] [-v] input_file refseq_file

Positional Arguments#

input_file: Path to a BAM/SAM/CRAM/Fragment file containing fragment data.
refseq_file: The .2bit file for the associate reference genome sequence used during alignment.

Named Arguments#

-k

Length of k-mer.

Default: 4

-o, --output-file

TSV to print k-mer frequencies. If “-”, will write to stdout.

Default: “-”

-q, --quality-threshold

Minimum mapping quality threshold.

Default: 20

-w, --workers

Number of worker processes.

Default: 1

-v, --verbose

Enable verbose mode to display detailed processing information.

Default: 0

interval-end-motifs#

Measures frequency of k-mer 5’ end motifs in each region specified in a BED file and writes data into a table.

finaletoolkit interval-end-motifs [-h] [-k K] [-lo FRACTION_LOW] [-hi FRACTION_HIGH] [-o OUTPUT_FILE] [-q QUALITY_THRESHOLD] [-w WORKERS] [-v]
                                  input_file refseq_file intervals

Positional Arguments#

input_file: Path to a BAM/SAM/CRAM/Fragment file containing fragment data.
refseq_file: The .2bit file for the associate reference genome sequence used during alignment.
intervals: Path to a BED file containing intervals to retrieve end motif frequencies over.

Named Arguments#

-k

Length of k-mer.

Default: 4

-lo, --fraction-low

Minimum length for a fragment to be included in end motif frequency.

Default: 10

-hi, --fraction-high

Maximum length for a fragment to be included in end motif frequency.

Default: 600

-o, --output-file

Path to TSV or CSV file to write end motif frequencies to.

Default: “-”

-q, --quality-threshold

Minimum mapping quality threshold.

Default: 20

-w, --workers

Number of worker processes.

Default: 1

-v, --verbose

Enable verbose mode to display detailed processing information.

Default: 0

mds#

Reads k-mer frequencies from a file and calculates a motif diversity score (MDS) using normalized Shannon entropy as described by Jiang et al (2020).

finaletoolkit mds [-h] [-s SEP] [--header HEADER] [file_path]

Positional Arguments#

file_path

Tab-delimited or similar file containing one column for all k-mers a one column for frequency. Reads from stdin by default.

Default: “-”

Named Arguments#

-s, --sep

Separator used in tabular file.

Default: ” “

--header

Number of header rows to ignore. Default is 0

Default: 0

interval-mds#

Reads k-mer frequencies from a file and calculates a motif diversity score (MDS) for each interval using normalized Shannon entropy as described by Jiang et al (2020).

finaletoolkit interval-mds [-h] [-s SEP] [file_path] file_out

Positional Arguments#

file_path

Tab-delimited or similar file containing one column for all k-mers a one column for frequency. Reads from stdin by default.

Default: “-”

file_out

Path to the output BED/BEDGraph file containing MDS for each interval.

Default: “-”

Named Arguments#

-s, --sep

Separator used in tabular file.

Default: ” “

filter-bam#

Filters a BAM file so that all reads are in mapped pairs, exceed a certain MAPQ, are not flagged for quality, are read1, are not secondary or supplementary alignments, and are on the same reference sequence as the mate.

finaletoolkit filter-bam [-h] [-r REGION_FILE] [-o OUTPUT_FILE] [-q QUALITY_THRESHOLD] [-hi FRACTION_HIGH] [-lo FRACTION_LOW] [-w WORKERS] [-v]
                         input_file

Positional Arguments#

input_file: Path to BAM file.

Named Arguments#

-r, --region-file

Only output alignments overlapping the intervals in this BED file will be included.

-o, --output-file

Output BAM file path.

Default: “-”

-q, --quality_threshold

Minimum mapping quality threshold.

Default: 30

-hi, --fraction-high

Maximum length for a fragment to be included in output BAM.

-lo, --fraction-low

Minimum length for a fragment to be included in output BAM.

-w, --workers

Number of worker processes.

Default: 1

-v, --verbose

Enable verbose mode to display detailed processing information.

agg-bw#

Aggregates a bigWig signal over constant-length intervals defined in a BED file.

finaletoolkit agg-bw [-h] [-o OUTPUT_FILE] [-m MEDIAN_WINDOW_SIZE] [-v] input_file interval_file

Positional Arguments#

input_file: A bigWig file containing signals over the intervals specified in interval file.
interval_file: Path to a BED file containing intervals over which signals were calculated over.

Named Arguments#

-o, --output-file

A wiggle file containing the aggregate signal over the intervals specified in interval file.

Default: “-”

-m, --median-window-size

Size of the median filter window used to adjust WPS scores. Only modify if aggregating WPS signals.

Default: 0

-v, --verbose

Enable verbose mode to display detailed processing information.

gap-bed#

Creates a BED4 file containing centromeres, telomeres, and short-arm intervals, similar to the gaps annotation track for hg19 found on the UCSC Genome Browser (Kent et al 2002). Currently only supports hg19, b37, human_g1k_v37, hg38, and GRCh38

finaletoolkit gap-bed [-h] {hg19,b37,human_g1k_v37,hg38,GRCh38} output_file

Positional Arguments#

reference_genome

Possible choices: hg19, b37, human_g1k_v37, hg38, GRCh38

Reference genome to provide gaps for.

output_file

Path to write BED file to. If “-” used, writes to stdout.

Gap is used liberally in this command, and in the case hg38/GRCh38, may refer to regions where there no longer are gaps in the reference sequence.