CLI#
Calculates fragmentation features given a CRAM, BAM, SAM, or Frag.gz file.
usage: finaletoolkit [-h] {coverage,frag-length,frag-length-bins,frag-length-intervals,wps,delfi,filter-bam,adjust-wps,agg-bw,delfi-gc-correct,end-motifs,interval-end-motifs,mds,interval-mds,gap-bed,cleavage-profile} ...
subcommands#
- subcommand
Possible choices: coverage, frag-length, frag-length-bins, frag-length-intervals, wps, delfi, filter-bam, adjust-wps, agg-bw, delfi-gc-correct, end-motifs, interval-end-motifs, mds, interval-mds, gap-bed, cleavage-profile
Sub-commands#
coverage#
Calculates fragmentation coverage over intervals in a BED file given a SAM, BAM, CRAM, or Frag.gz file
finaletoolkit coverage [-h] [-o OUTPUT_FILE] [-s SCALE_FACTOR] [-q QUALITY_THRESHOLD] [-w WORKERS] [-v] input_file interval_file
Positional Arguments#
- input_file
SAM, BAM, CRAM, or Frag.gz file containing fragment data
- interval_file
BED file containing intervals over which coverage is calculated
Named Arguments#
- -o, --output_file
BED file where coverage is printed
Default: “-”
- -s, --scale-factor
Amount coverage will be multiplied by
Default: 1000000.0
- -q, --quality_threshold
Default: 30
- -w, --workers
Number of worker processes to use. Default is 1.
Default: 1
- -v, --verbose
Default: 0
frag-length#
Calculates fragment lengths given a CRAM/BAM/SAM file
finaletoolkit frag-length [-h] [-c CONTIG] [-S START] [-E STOP] [-p INTERSECT_POLICY] [-o OUTPUT_FILE] [-q QUALITY_THRESHOLD] [-v] input_file
Positional Arguments#
- input_file
bam or frag.gz file containing fragment data.
Named Arguments#
- -c, --contig
contig or chromosome to select fragments from. Required if using –start or –stop.
- -S, --start
0-based left-most coordinate of interval to select fragmentsfrom. Must also use –contig.
- -E, --stop
1-based right-most coordinate of interval to select fragmentsfrom. Must also use –contig.
- -p, --intersect_policy
Specifies what policy is used to include fragments in the given interval. Default is “midpoint”. Policies include: - midpoint: the average of end coordinates of a fragment liesin the interval. - any: any part of the fragment is in the interval.
Default: “midpoint”
- -o, --output_file
File to write results to. “-” may be used to write to stdout. Default is “-“.
Default: “-”
- -q, --quality_threshold
Minimum MAPQ. Default is 30.
Default: 30
- -v, --verbose
Verbose logging.
Default: 0
frag-length-bins#
computes frag lengths of fragments and agregates in bins by length. Either writes bins and counts to tsv or prints a histogram
finaletoolkit frag-length-bins [-h] [-c CONTIG] [-S START] [-p INTERSECT_POLICY] [-E STOP] [--bin-size BIN_SIZE] [-o OUTPUT_FILE] [--contig-by-contig] [--histogram] [-q QUALITY_THRESHOLD] [-v] input_file
Positional Arguments#
- input_file
BAM or SAM file containing fragment data
Named Arguments#
- -c, --contig
contig or chromosome to select fragments from. Required if using –start or –stop.
- -S, --start
0-based left-most coordinate of interval to select fragmentsfrom. Must also use –contig.
- -p, --intersect_policy
Specifies what policy is used to include fragments in the given interval. Default is “midpoint”. Policies include: - midpoint: the average of end coordinates of a fragment liesin the interval. - any: any part of the fragment is in the interval.
Default: “midpoint”
- -E, --stop
1-based right-most coordinate of interval to select fragmentsfrom. Must also use –contig.
- --bin-size
Used to specify a custom bin size instead of automatically calculating one.
- -o, --output_file
File to write results to. “-” may be used to write to stdout. Default is “-“.
Default: “-”
- --contig-by-contig
Placeholder, not implemented.
Default: False
- --histogram
Draws a histogram in the terminal.
Default: False
- -q, --quality_threshold
Minimum MAPQ. Default is 30.
Default: 30
- -v, --verbose
Verbose logging.
Default: 0
frag-length-intervals#
Calculates frag lengths statistics over user-specified genomic intervals.
finaletoolkit frag-length-intervals [-h] [-p INTERSECT_POLICY] [-o OUTPUT_FILE] [-q QUALITY_THRESHOLD] [-w WORKERS] [-v] input_file interval_file
Positional Arguments#
- input_file
BAM or SAM file containing PE WGS of cfDNA
- interval_file
BED file containing intervals over which to produce statistics
Named Arguments#
- -p, --intersect_policy
Specifies what policy is used to include fragments in the given interval. Default is “midpoint”. Policies include: - midpoint: the average of end coordinates of a fragment liesin the interval. - any: any part of the fragment is in the interval.
Default: “midpoint”
- -o, --output-file
File to print results to. if “-”, will print to stdout. Defaultis “-“.
Default: “-”
- -q, --quality-threshold
minimum MAPQ to filter for
Default: 30
- -w, --workers
Number of subprocesses to use
Default: 1
- -v, --verbose
Determines how much is written to stderr
Default: 0
wps#
Calculates Windowed Protection Score over a region around sites specified in a BED file from alignments in a CRAM/BAM/SAM/Frag.gz file
finaletoolkit wps [-h] [-o OUTPUT_FILE] [-i INTERVAL_SIZE] [-W WINDOW_SIZE] [-lo FRACTION_LOW] [-hi FRACTION_HIGH] [-q QUALITY_THRESHOLD] [-w WORKERS] [-v] input_file site_bed
Positional Arguments#
- input_file
bam or sam file containing paired-end reads of cfDNA WGS
- site_bed
bed file containing sites over which to calculate wps
Named Arguments#
- -o, --output_file
BigWig file to write results to. Default is stdout
Default: “-”
- -i, --interval_size
Default: 5000
- -W, --window_size
Default: 120
- -lo, --fraction_low
Default: 120
- -hi, --fraction_high
Default: 180
- -q, --quality_threshold
Default: 30
- -w, --workers
Default: 1
- -v, --verbose
Default: 0
delfi#
Calculates DELFI score over genome. NOTE: due to some ad hoc implementation details, currently the only accepted reference genome is hg19.
finaletoolkit delfi [-h] [-b BLACKLIST_FILE] [-g GAP_FILE] [-o OUTPUT_FILE] [-W WINDOW_SIZE] [-gc] [-m] [-q QUALITY_THRESHOLD] [-w WORKERS] [-v] input_file autosomes reference_file bins_file
Positional Arguments#
- input_file
SAM, BAM, CRAM, or Frag.gz file containing fragment reads.
- autosomes
Tab-delimited file where column one is chromosomes and column two is the length of said chromosome.
- reference_file
2bit file for reference sequence used during alignment.
- bins_file
BED format file containing bins over which to calculate delfi. To replicate Cristiano and colleage’s methodology, use 100kb bins over human autosomes.
Named Arguments#
- -b, --blacklist_file
BED file containing darkregions to ignore when calculating DELFI.
- -g, --gap_file
BED4 format file with columns “chrom”,”start”,”stop”,”type”. “type” should be “centromere”, “telomere”, or “short arm”; all others are ignored. This information corresponds to “gap” track for hg19 in UCSC Genome Browser.
- -o, --output_file
BED, bed.gz, tsv, or csv file to write results to. If “-”, writes tab-deliniated data to stdout. Default is “-“.
Default: “-”
- -W, --window_size
Currently unused.
Default: 5000000
- -gc, --gc_correct
Indicate whther or not gc correction is applied.
Default: False
- -m, --merge_bins
Indicate whther or not bins are merged to 5Mb bins.
Default: False
- -q, --quality_threshold
MAPQ to be filtered.
Default: 30
- -w, --workers
Maximum number of subprocesses to spawn. Should be close to number of cores.
Default: 1
- -v, --verbose
Default: 0
filter-bam#
Filters a BAM file so that all reads are in mapped pairs, exceed a certain MAPQ, are not flagged for quality, are read1, are not secondary or supplementary alignments, and are on the same reference sequence as the mate.
finaletoolkit filter-bam [-h] [-r REGION_FILE] [-o OUTPUT_FILE] [-q QUALITY_THRESHOLD] [-hi FRACTION_HIGH] [-lo FRACTION_LOW] [-w WORKERS] [-v] input_file
Positional Arguments#
- input_file
BAM file with PE WGS
Named Arguments#
- -r, --region-file
BED file containing regions to read fragments from. Default is None.
- -o, --output-file
Path to write filtered BAM. Defualt is “-”. If set to “-”, the BAM file will be written to stdout.
Default: “-”
- -q, --quality_threshold
Minimum mapping quality to filter for. Defualt is 30.
Default: 30
- -hi, --fraction-high
Maximum fragment size. Default is None
- -lo, --fraction-low
Minimum fragment size. Default is None
- -w, --workers
Number of worker processes to spawn.
Default: 1
- -v, --verbose
Specify verbosity. Number of printed statements is proportional to number of vs.
adjust-wps#
Reads WPS data from a WIG file and applies a median filter and a Savitsky-Golay filter (Savitsky and Golay, 1964).
finaletoolkit adjust-wps [-h] [-o OUTPUT_FILE] [-m MEDIAN_WINDOW_SIZE] [-s SAVGOL_WINDOW_SIZE] [-p SAVGOL_POLY_DEG] [-w WORKERS] [--mean] [--subtract-edges] [-v] input_file interval_file genome_file
Positional Arguments#
- input_file
BigWig file with WPS data.
- interval_file
BED file containing intervals over which wps was calculated
- genome_file
GENOME file containing chromosome/contig names and lengths. Needed to write head for BigWig.
Named Arguments#
- -o, --output-file
WIG file to print filtered WPS data. If “-”, will write to stdout. Default is “-“.
Default: “-”
- -m, --median-window-size
Size of window for median filter. Default is 1000.
Default: 1000
- -s, --savgol-window-size
Size of window for Savitsky-Golay filter. Default is 21.
Default: 21
- -p, --savgol-poly-deg
Degree polynomial for Savitsky-Golay filter. Default is 2.
Default: 2
- -w, --workers
Number of subprocesses to use. Default is 1.
Default: 1
- --mean
Default: False
- --subtract-edges
Default: False
- -v, --verbose
Specify verbosity. Number of printed statements is proportional to number of vs.
agg-bw#
Reads data from a BigWig file and aggregates over intervals in a BED file.
finaletoolkit agg-bw [-h] [-o OUTPUT_FILE] [-m MEDIAN_WINDOW_SIZE] [-v] input_file interval_file
Positional Arguments#
- input_file
BigWig file with data.
- interval_file
BED file containing intervals over which wps was calculated
Named Arguments#
- -o, --output-file
WIG file to print filtered WPS data. If “-”, will write to stdout. Default is “-“.
Default: “-”
- -m, --median-window-size
Size of window for median filter. Default is 1000.
Default: 1000
- -v, --verbose
Specify verbosity. Number of printed statements is proportional to number of vs.
delfi-gc-correct#
Performs gc-correction on raw delfi data.
finaletoolkit delfi-gc-correct [-h] [-o OUTPUT_FILE] [--header-lines HEADER_LINES] [-v] input_file
Positional Arguments#
- input_file
BED3+3 file containing raw data
Named Arguments#
- -o, --output-file
BED3+3 to print GC-corrected DELFI fractions. If “-”, will write to stdout. Default is “-“.
Default: “-”
- --header-lines
Number of header lines in BED. Default is 1.
Default: 1
- -v, --verbose
Specify verbosity. Number of printed statements is proportional to number of vs.
end-motifs#
Measures frequency of k-mer 5’ end motifs and tabulates data into a tab-delimited file.
finaletoolkit end-motifs [-h] [-k K] [-o OUTPUT_FILE] [-q QUALITY_THRESHOLD] [-w WORKERS] [-v] input_file refseq_file
Positional Arguments#
- input_file
SAM, BAM, or tabix-indexed file with fragment data.
- refseq_file
2bit file containing reference sequence that fragments were aligned to.
Named Arguments#
- -k
Length of k-mer. Default is 4.
Default: 4
- -o, --output-file
TSV to print k-mer frequencies. If “-”, will write to stdout. Default is “-“.
Default: “-”
- -q, --quality-threshold
Minimum MAPQ of reads. Default is 20.
Default: 20
- -w, --workers
Number of subprocesses to use. Default is 1.
Default: 1
- -v, --verbose
Specify verbosity. Number of printed statements is proportional to number of vs.
Default: 0
interval-end-motifs#
Measures frequency of k-mer 5’ end motifs in each region specified in a BED file and writes data into a table.
finaletoolkit interval-end-motifs [-h] [-k K] [-lo FRACTION_LOW] [-hi FRACTION_HIGH] [-o OUTPUT_FILE] [-q QUALITY_THRESHOLD] [-w WORKERS] [-v] input_file refseq_file intervals
Positional Arguments#
- input_file
SAM, BAM, or tabix-indexed file with fragment data.
- refseq_file
2bit file containing reference sequence that fragments were aligned to.
- intervals
BED file containing intervals or list of tuples
Named Arguments#
- -k
Length of k-mer. Default is 4.
Default: 4
- -lo, --fraction-low
Smallest fragment length to consider. Default is 10
Default: 10
- -hi, --fraction-high
Longest fragment length to consider. Default is 600
Default: 600
- -o, --output-file
File path to write results to. Either tsv or csv.
Default: “-”
- -q, --quality-threshold
Minimum MAPQ of reads. Default is 20.
Default: 20
- -w, --workers
Number of subprocesses to use. Default is 1.
Default: 1
- -v, --verbose
Specify verbosity. Number of printed statements is proportional to number of vs.
Default: 0
mds#
Reads k-mer frequencies from a file and calculates a motif diversity score (MDS) using normalized Shannon entropy as described by Jiang et al (2020). This function is generalized for any k-mer instead of just 4-mers.
finaletoolkit mds [-h] [-s SEP] [--header HEADER] [file_path]
Positional Arguments#
- file_path
Tab-delimited or similar file containing one column for all k-mers a one column for frequency. Reads from stdin by default.
Default: “-”
Named Arguments#
- -s, --sep
Separator used in tabular file. Default is tab.
Default: “ “
- --header
Number of header rows to ignore. Default is 0
Default: 0
interval-mds#
Reads k-mer frequencies from a file and calculates a motif diversity score (MDS) for each interval using normalized Shannon entropy as described by Jiang et al (2020). This function is generalized for any k-mer instead of just 4-mers.
finaletoolkit interval-mds [-h] [-s SEP] [file_path] file_out
Positional Arguments#
- file_path
Tab-delimited or similar file containing one column for all k-mers a one column for frequency. Reads from stdin by default.
Default: “-”
- file_out
Default: “-”
Named Arguments#
- -s, --sep
Separator used in tabular file. Default is tab.
Default: “ “
gap-bed#
Creates a BED4 file containing centromeres, telomeres, and short-arm intervals, similar to the gaps annotation track for hg19 found on the UCSC Genome Browser (Kent et al 2002). Currently only supports hg19, b37, human_g1k_v37, hg38, and GRCh38
finaletoolkit gap-bed [-h] {hg19,b37,human_g1k_v37,hg38,GRCh38} output_file
Positional Arguments#
- reference_genome
Possible choices: hg19, b37, human_g1k_v37, hg38, GRCh38
Reference genome to provide gaps for.
- output_file
Path to write bed file to. If “-” used, writes to stdout.
Gap is used liberally in this command, and in the case hg38/GRCh38, may refer to regions where there no longer are gaps in the reference sequence.
cleavage-profile#
wip
finaletoolkit cleavage-profile [-h] [-o OUTPUT_FILE] [-lo FRACTION_LOW] [-hi FRACTION_HIGH] [-q QUALITY_THRESHOLD] [-w WORKERS] [-v] input_file interval_file
Positional Arguments#
- input_file
BAM, CRAM, or frag.gz containing fragment coordinates.
- interval_file
BED file containing intervals to calculate cleavage profile over.
Named Arguments#
- -o, --output_file
Path to write output file to. If “-” used, writes bed.gz to stdout. Writes in BigWig format if “.bw” or “.bigwig” used, and writes in gzip compressed bed file if “.bed.gz” or “.bedGraph.gz” suffixes used. Default is “-“.
Default: “-”
- -lo, --fraction_low
Default: 120
- -hi, --fraction_high
Default: 180
- -q, --quality-threshold
Minimum MAPQ of reads. Default is 20.
Default: 20
- -w, --workers
Number of subprocesses to use. Default is 1.
Default: 1
- -v, --verbose
Specify verbosity. Number of printed statements is proportional to number of vs.
Default: 0