CLI#

Calculates fragmentation features given a CRAM, BAM, SAM, or Frag.gz file.

usage: finaletoolkit [-h] {coverage,frag-length,frag-length-bins,frag-length-intervals,wps,delfi,filter-bam,adjust-wps,agg-bw,delfi-gc-correct,end-motifs,interval-end-motifs,mds,interval-mds,gap-bed,cleavage-profile} ...

subcommands#

subcommand

Possible choices: coverage, frag-length, frag-length-bins, frag-length-intervals, wps, delfi, filter-bam, adjust-wps, agg-bw, delfi-gc-correct, end-motifs, interval-end-motifs, mds, interval-mds, gap-bed, cleavage-profile

Sub-commands#

coverage#

Calculates fragmentation coverage over intervals in a BED file given a SAM, BAM, CRAM, or Frag.gz file

finaletoolkit coverage [-h] [-o OUTPUT_FILE] [-s SCALE_FACTOR] [-q QUALITY_THRESHOLD] [-w WORKERS] [-v] input_file interval_file

Positional Arguments#

input_file

SAM, BAM, CRAM, or Frag.gz file containing fragment data

interval_file

BED file containing intervals over which coverage is calculated

Named Arguments#

-o, --output_file

BED file where coverage is printed

Default: “-”

-s, --scale-factor

Amount coverage will be multiplied by

Default: 1000000.0

-q, --quality_threshold

Default: 30

-w, --workers

Number of worker processes to use. Default is 1.

Default: 1

-v, --verbose

Default: 0

frag-length#

Calculates fragment lengths given a CRAM/BAM/SAM file

finaletoolkit frag-length [-h] [-c CONTIG] [-S START] [-E STOP] [-p INTERSECT_POLICY] [-o OUTPUT_FILE] [-q QUALITY_THRESHOLD] [-v] input_file

Positional Arguments#

input_file

bam or frag.gz file containing fragment data.

Named Arguments#

-c, --contig

contig or chromosome to select fragments from. Required if using –start or –stop.

-S, --start

0-based left-most coordinate of interval to select fragmentsfrom. Must also use –contig.

-E, --stop

1-based right-most coordinate of interval to select fragmentsfrom. Must also use –contig.

-p, --intersect_policy

Specifies what policy is used to include fragments in the given interval. Default is “midpoint”. Policies include: - midpoint: the average of end coordinates of a fragment liesin the interval. - any: any part of the fragment is in the interval.

Default: “midpoint”

-o, --output_file

File to write results to. “-” may be used to write to stdout. Default is “-“.

Default: “-”

-q, --quality_threshold

Minimum MAPQ. Default is 30.

Default: 30

-v, --verbose

Verbose logging.

Default: 0

frag-length-bins#

computes frag lengths of fragments and agregates in bins by length. Either writes bins and counts to tsv or prints a histogram

finaletoolkit frag-length-bins [-h] [-c CONTIG] [-S START] [-p INTERSECT_POLICY] [-E STOP] [--bin-size BIN_SIZE] [-o OUTPUT_FILE] [--contig-by-contig] [--histogram] [-q QUALITY_THRESHOLD] [-v] input_file

Positional Arguments#

input_file

BAM or SAM file containing fragment data

Named Arguments#

-c, --contig

contig or chromosome to select fragments from. Required if using –start or –stop.

-S, --start

0-based left-most coordinate of interval to select fragmentsfrom. Must also use –contig.

-p, --intersect_policy

Specifies what policy is used to include fragments in the given interval. Default is “midpoint”. Policies include: - midpoint: the average of end coordinates of a fragment liesin the interval. - any: any part of the fragment is in the interval.

Default: “midpoint”

-E, --stop

1-based right-most coordinate of interval to select fragmentsfrom. Must also use –contig.

--bin-size

Used to specify a custom bin size instead of automatically calculating one.

-o, --output_file

File to write results to. “-” may be used to write to stdout. Default is “-“.

Default: “-”

--contig-by-contig

Placeholder, not implemented.

Default: False

--histogram

Draws a histogram in the terminal.

Default: False

-q, --quality_threshold

Minimum MAPQ. Default is 30.

Default: 30

-v, --verbose

Verbose logging.

Default: 0

frag-length-intervals#

Calculates frag lengths statistics over user-specified genomic intervals.

finaletoolkit frag-length-intervals [-h] [-p INTERSECT_POLICY] [-o OUTPUT_FILE] [-q QUALITY_THRESHOLD] [-w WORKERS] [-v] input_file interval_file

Positional Arguments#

input_file

BAM or SAM file containing PE WGS of cfDNA

interval_file

BED file containing intervals over which to produce statistics

Named Arguments#

-p, --intersect_policy

Specifies what policy is used to include fragments in the given interval. Default is “midpoint”. Policies include: - midpoint: the average of end coordinates of a fragment liesin the interval. - any: any part of the fragment is in the interval.

Default: “midpoint”

-o, --output-file

File to print results to. if “-”, will print to stdout. Defaultis “-“.

Default: “-”

-q, --quality-threshold

minimum MAPQ to filter for

Default: 30

-w, --workers

Number of subprocesses to use

Default: 1

-v, --verbose

Determines how much is written to stderr

Default: 0

wps#

Calculates Windowed Protection Score over a region around sites specified in a BED file from alignments in a CRAM/BAM/SAM/Frag.gz file

finaletoolkit wps [-h] [-o OUTPUT_FILE] [-i INTERVAL_SIZE] [-W WINDOW_SIZE] [-lo FRACTION_LOW] [-hi FRACTION_HIGH] [-q QUALITY_THRESHOLD] [-w WORKERS] [-v] input_file site_bed

Positional Arguments#

input_file

bam or sam file containing paired-end reads of cfDNA WGS

site_bed

bed file containing sites over which to calculate wps

Named Arguments#

-o, --output_file

BigWig file to write results to. Default is stdout

Default: “-”

-i, --interval_size

Default: 5000

-W, --window_size

Default: 120

-lo, --fraction_low

Default: 120

-hi, --fraction_high

Default: 180

-q, --quality_threshold

Default: 30

-w, --workers

Default: 1

-v, --verbose

Default: 0

delfi#

Calculates DELFI score over genome. NOTE: due to some ad hoc implementation details, currently the only accepted reference genome is hg19.

finaletoolkit delfi [-h] [-b BLACKLIST_FILE] [-g GAP_FILE] [-o OUTPUT_FILE] [-W WINDOW_SIZE] [-gc] [-m] [-q QUALITY_THRESHOLD] [-w WORKERS] [-v] input_file autosomes reference_file bins_file

Positional Arguments#

input_file

SAM, BAM, CRAM, or Frag.gz file containing fragment reads.

autosomes

Tab-delimited file where column one is chromosomes and column two is the length of said chromosome.

reference_file

2bit file for reference sequence used during alignment.

bins_file

BED format file containing bins over which to calculate delfi. To replicate Cristiano and colleage’s methodology, use 100kb bins over human autosomes.

Named Arguments#

-b, --blacklist_file

BED file containing darkregions to ignore when calculating DELFI.

-g, --gap_file

BED4 format file with columns “chrom”,”start”,”stop”,”type”. “type” should be “centromere”, “telomere”, or “short arm”; all others are ignored. This information corresponds to “gap” track for hg19 in UCSC Genome Browser.

-o, --output_file

BED, bed.gz, tsv, or csv file to write results to. If “-”, writes tab-deliniated data to stdout. Default is “-“.

Default: “-”

-W, --window_size

Currently unused.

Default: 5000000

-gc, --gc_correct

Indicate whther or not gc correction is applied.

Default: False

-m, --merge_bins

Indicate whther or not bins are merged to 5Mb bins.

Default: False

-q, --quality_threshold

MAPQ to be filtered.

Default: 30

-w, --workers

Maximum number of subprocesses to spawn. Should be close to number of cores.

Default: 1

-v, --verbose

Default: 0

filter-bam#

Filters a BAM file so that all reads are in mapped pairs, exceed a certain MAPQ, are not flagged for quality, are read1, are not secondary or supplementary alignments, and are on the same reference sequence as the mate.

finaletoolkit filter-bam [-h] [-r REGION_FILE] [-o OUTPUT_FILE] [-q QUALITY_THRESHOLD] [-hi FRACTION_HIGH] [-lo FRACTION_LOW] [-w WORKERS] [-v] input_file

Positional Arguments#

input_file

BAM file with PE WGS

Named Arguments#

-r, --region-file

BED file containing regions to read fragments from. Default is None.

-o, --output-file

Path to write filtered BAM. Defualt is “-”. If set to “-”, the BAM file will be written to stdout.

Default: “-”

-q, --quality_threshold

Minimum mapping quality to filter for. Defualt is 30.

Default: 30

-hi, --fraction-high

Maximum fragment size. Default is None

-lo, --fraction-low

Minimum fragment size. Default is None

-w, --workers

Number of worker processes to spawn.

Default: 1

-v, --verbose

Specify verbosity. Number of printed statements is proportional to number of vs.

adjust-wps#

Reads WPS data from a WIG file and applies a median filter and a Savitsky-Golay filter (Savitsky and Golay, 1964).

finaletoolkit adjust-wps [-h] [-o OUTPUT_FILE] [-m MEDIAN_WINDOW_SIZE] [-s SAVGOL_WINDOW_SIZE] [-p SAVGOL_POLY_DEG] [-w WORKERS] [--mean] [--subtract-edges] [-v] input_file interval_file genome_file

Positional Arguments#

input_file

BigWig file with WPS data.

interval_file

BED file containing intervals over which wps was calculated

genome_file

GENOME file containing chromosome/contig names and lengths. Needed to write head for BigWig.

Named Arguments#

-o, --output-file

WIG file to print filtered WPS data. If “-”, will write to stdout. Default is “-“.

Default: “-”

-m, --median-window-size

Size of window for median filter. Default is 1000.

Default: 1000

-s, --savgol-window-size

Size of window for Savitsky-Golay filter. Default is 21.

Default: 21

-p, --savgol-poly-deg

Degree polynomial for Savitsky-Golay filter. Default is 2.

Default: 2

-w, --workers

Number of subprocesses to use. Default is 1.

Default: 1

--mean

Default: False

--subtract-edges

Default: False

-v, --verbose

Specify verbosity. Number of printed statements is proportional to number of vs.

agg-bw#

Reads data from a BigWig file and aggregates over intervals in a BED file.

finaletoolkit agg-bw [-h] [-o OUTPUT_FILE] [-m MEDIAN_WINDOW_SIZE] [-v] input_file interval_file

Positional Arguments#

input_file

BigWig file with data.

interval_file

BED file containing intervals over which wps was calculated

Named Arguments#

-o, --output-file

WIG file to print filtered WPS data. If “-”, will write to stdout. Default is “-“.

Default: “-”

-m, --median-window-size

Size of window for median filter. Default is 1000.

Default: 1000

-v, --verbose

Specify verbosity. Number of printed statements is proportional to number of vs.

delfi-gc-correct#

Performs gc-correction on raw delfi data.

finaletoolkit delfi-gc-correct [-h] [-o OUTPUT_FILE] [--header-lines HEADER_LINES] [-v] input_file

Positional Arguments#

input_file

BED3+3 file containing raw data

Named Arguments#

-o, --output-file

BED3+3 to print GC-corrected DELFI fractions. If “-”, will write to stdout. Default is “-“.

Default: “-”

--header-lines

Number of header lines in BED. Default is 1.

Default: 1

-v, --verbose

Specify verbosity. Number of printed statements is proportional to number of vs.

end-motifs#

Measures frequency of k-mer 5’ end motifs and tabulates data into a tab-delimited file.

finaletoolkit end-motifs [-h] [-k K] [-o OUTPUT_FILE] [-q QUALITY_THRESHOLD] [-w WORKERS] [-v] input_file refseq_file

Positional Arguments#

input_file

SAM, BAM, or tabix-indexed file with fragment data.

refseq_file

2bit file containing reference sequence that fragments were aligned to.

Named Arguments#

-k

Length of k-mer. Default is 4.

Default: 4

-o, --output-file

TSV to print k-mer frequencies. If “-”, will write to stdout. Default is “-“.

Default: “-”

-q, --quality-threshold

Minimum MAPQ of reads. Default is 20.

Default: 20

-w, --workers

Number of subprocesses to use. Default is 1.

Default: 1

-v, --verbose

Specify verbosity. Number of printed statements is proportional to number of vs.

Default: 0

interval-end-motifs#

Measures frequency of k-mer 5’ end motifs in each region specified in a BED file and writes data into a table.

finaletoolkit interval-end-motifs [-h] [-k K] [-lo FRACTION_LOW] [-hi FRACTION_HIGH] [-o OUTPUT_FILE] [-q QUALITY_THRESHOLD] [-w WORKERS] [-v] input_file refseq_file intervals

Positional Arguments#

input_file

SAM, BAM, or tabix-indexed file with fragment data.

refseq_file

2bit file containing reference sequence that fragments were aligned to.

intervals

BED file containing intervals or list of tuples

Named Arguments#

-k

Length of k-mer. Default is 4.

Default: 4

-lo, --fraction-low

Smallest fragment length to consider. Default is 10

Default: 10

-hi, --fraction-high

Longest fragment length to consider. Default is 600

Default: 600

-o, --output-file

File path to write results to. Either tsv or csv.

Default: “-”

-q, --quality-threshold

Minimum MAPQ of reads. Default is 20.

Default: 20

-w, --workers

Number of subprocesses to use. Default is 1.

Default: 1

-v, --verbose

Specify verbosity. Number of printed statements is proportional to number of vs.

Default: 0

mds#

Reads k-mer frequencies from a file and calculates a motif diversity score (MDS) using normalized Shannon entropy as described by Jiang et al (2020). This function is generalized for any k-mer instead of just 4-mers.

finaletoolkit mds [-h] [-s SEP] [--header HEADER] [file_path]

Positional Arguments#

file_path

Tab-delimited or similar file containing one column for all k-mers a one column for frequency. Reads from stdin by default.

Default: “-”

Named Arguments#

-s, --sep

Separator used in tabular file. Default is tab.

Default: “ “

--header

Number of header rows to ignore. Default is 0

Default: 0

interval-mds#

Reads k-mer frequencies from a file and calculates a motif diversity score (MDS) for each interval using normalized Shannon entropy as described by Jiang et al (2020). This function is generalized for any k-mer instead of just 4-mers.

finaletoolkit interval-mds [-h] [-s SEP] [file_path] file_out

Positional Arguments#

file_path

Tab-delimited or similar file containing one column for all k-mers a one column for frequency. Reads from stdin by default.

Default: “-”

file_out

Default: “-”

Named Arguments#

-s, --sep

Separator used in tabular file. Default is tab.

Default: “ “

gap-bed#

Creates a BED4 file containing centromeres, telomeres, and short-arm intervals, similar to the gaps annotation track for hg19 found on the UCSC Genome Browser (Kent et al 2002). Currently only supports hg19, b37, human_g1k_v37, hg38, and GRCh38

finaletoolkit gap-bed [-h] {hg19,b37,human_g1k_v37,hg38,GRCh38} output_file

Positional Arguments#

reference_genome

Possible choices: hg19, b37, human_g1k_v37, hg38, GRCh38

Reference genome to provide gaps for.

output_file

Path to write bed file to. If “-” used, writes to stdout.

Gap is used liberally in this command, and in the case hg38/GRCh38, may refer to regions where there no longer are gaps in the reference sequence.

cleavage-profile#

wip

finaletoolkit cleavage-profile [-h] [-o OUTPUT_FILE] [-lo FRACTION_LOW] [-hi FRACTION_HIGH] [-q QUALITY_THRESHOLD] [-w WORKERS] [-v] input_file interval_file

Positional Arguments#

input_file

BAM, CRAM, or frag.gz containing fragment coordinates.

interval_file

BED file containing intervals to calculate cleavage profile over.

Named Arguments#

-o, --output_file

Path to write output file to. If “-” used, writes bed.gz to stdout. Writes in BigWig format if “.bw” or “.bigwig” used, and writes in gzip compressed bed file if “.bed.gz” or “.bedGraph.gz” suffixes used. Default is “-“.

Default: “-”

-lo, --fraction_low

Default: 120

-hi, --fraction_high

Default: 180

-q, --quality-threshold

Minimum MAPQ of reads. Default is 20.

Default: 20

-w, --workers

Number of subprocesses to use. Default is 1.

Default: 1

-v, --verbose

Specify verbosity. Number of printed statements is proportional to number of vs.

Default: 0