Frag File Utilities#
- finaletoolkit.utils.filter_file(input_file: str, whitelist_file: str | None = None, blacklist_file: str | None = None, output_file: str | None = None, min_length: int | None = None, max_length: int | None = None, intersect_policy: str = 'midpoint', quality_threshold: int = 30, workers: int = 1, verbose: bool = False, fraction_low: int | None = None, fraction_high: int | None = None)#
Accepts the path to a BAM, CRAM, or BED file and creates a filtered version.
Filter reads/intervals based on exceeding the specified quality threshold, intersections with a region in the region bed (if provided), and read length.
For BAM/CRAM files, it also filters reads based on being read1 in a proper pair.
- Parameters:
input_file (str) – Path string to the input BAM, CRAM, or BED file.
whitelist_file (str, optional) – Path to a BED file defining regions to include.
blacklist_file (str, optional) – Path to a BED file defining regions to exclude.
output_file (str, optional) – Path to the output filtered file. If None, a temporary file is created.
min_length (int, optional) – Minimum length for reads/intervals
max_length (int, optional) – Maximum length for reads/intervals
intersect_policy (str, optional) – Specifies how to determine whether fragments are in interval for whitelisting and blacklisting functionality. ‘midpoint’ (default) calculates the central coordinate of each fragment and only selects the fragment if the midpoint is in the interval. ‘any’ includes fragments with any overlap with the interval.
quality_threshold (int, optional) – Minimum mapping quality score
workers (int, optional) – Number of worker threads for samtools.
verbose (bool, optional) – Default is False
fraction_low (int, optional) – Deprecated alias for min_length
fraction_high (int, optional) – Deprecated alias for max_length
- Returns:
output_file – Path to the filtered output file.
- Return type:
- finaletoolkit.utils.agg_bw(input_file: str | PathLike, interval_file: str | PathLike, output_file: str | PathLike, median_window_size: int = 1, mean: bool = False, verbose: bool = False)#
Takes a BigWig and an interval BED and aggregates signal along the intervals with a median filter.
For aggregating WPS signals, note that the median filter trims the ends of each interval by half of the window size of the filter while adjusting data. There are two way this can be approached in aggregation:
1. supply an interval file containing smaller intervals. e.g. if you used 5kb intervals for WPS and used a median filter window of 1kb, supply a BED file with 4kb windows to this function.
2. provide the size of the median filter window in median_window_size along with the original intervals. e.g if 5kb intervals were used for WPS and a 1kb median filter window was used, supply the 5kb bed file and median filter window size to this function.
Do not do both of these at once.
- Parameters:
input_file (str)
interval_file (str) – BED file containing intervals. 6th column should have strand.
output_file (str)
median_window_size (int, optional) – default is 1 (no smoothing). Set to 120 if replicating Snyder et al.
mean (bool) – use mean filter instead
verbose (int or bool, optional) – default is False
- Returns:
- Return type:
- finaletoolkit.utils.chrom_sizes_to_list(chrom_sizes_file: str | PathLike) list[tuple[str, int]] #
Reads chromosome names and sizes from a CHROMSIZE file into a list.
- Parameters:
chrom_sizes_file (str or Path) – Tab-delimited file with column for chrom names and column for chrom sizes.
- Returns:
chrom names and sizes.
- Return type:
list of string, int tuples
- finaletoolkit.utils.chrom_sizes_to_dict(chrom_sizes_file: str | PathLike) dict[str, int] #
Reads chromosome names and sizes from a CHROMSIZE file into a dict.
- Parameters:
chrom_sizes_file (str or Path) – Tab-delimited file with column for chrom names and column for chrom sizes.
- Returns:
Chrom names are keys and values are int chrom sizes.
- Return type:
- finaletoolkit.utils.frag_generator(input_file: FragFile, contig: str | None, quality_threshold: int = 30, start: int | None = None, stop: int | None = None, min_length: int | None = None, max_length: int | None = None, intersect_policy: str = 'midpoint', verbose: bool | int = False) Generator[tuple] #
Reads from BAM, CRAM, Fragment file and returns tuples containing contig (chromosome), start, stop (end), mapq, and strand for each fragment. Optionally may filter for mapq, size, and intersection with a region.
- Parameters:
input_file (str, pathlike, pysam TabixFile, or pysam AlignmentFile) – Fragment coordinates stored as a BAM, CRAM, or tabix-indexed FinaleDB fragment file. Can also be a pysam object of these files.
contig (str or None) – Chromosome to fetch fragments over. May be None for genome-wide.
quality_threshold (int, optional)
start (int, optional) – Left-most coordinate of interval to fetch from. See intersect_policy.
stop (int, optional) – Right-most coordinate of interval to fetch from. See intersect_policy.
min_length (int, optional) – Specifies lowest fragment length included in array. Default is 120, equivalent to long fraction.
max_length (int, optional) – Specifies highest fragment length included in array. Default is 120, equivalent to long fraction.
intersect_policy (str, optional) – Specifies what policy is used to include fragments in the given interval. Default is “midpoint”. Policies include: - midpoint: the average of end coordinates of a fragment lies in the interval. - any: any part of the fragment is in the interval.
verbose (bool, optional)
- Returns:
frag_ends – Generator that yields tuples: (contig: str, read_start: int, read_stop: int, mapq: int, read_on_plus: boolean)
- Return type:
- finaletoolkit.utils.frag_array(input_file: str | PathLike | AlignmentFile | TabixFile, contig: str, quality_threshold: int = 30, start: int | None = None, stop: int | None = None, min_length: int | None = None, max_length: int | None = None, intersect_policy: str = 'midpoint', verbose: bool = False) ndarray[Any, dtype[_ScalarType_co]] #
Reads from BAM, CRAM, or fragment file and returns a three column matrix with fragment start and stop positions and strand.
- Parameters:
input_file (str or AlignmentFile)
contig (str)
quality_threshold (int, optional)
start (int, optional)
stop (int, optional)
min_length (int, optional) – Specifies lowest fragment length included in array. Default is 120, equivalent to long fraction.
max_length (int, optional) – Specifies highest fragment length included in array. Default is 120, equivalent to long fraction. intersect_policy : str, optional Specifies what policy is used to include fragments in the given interval. Default is “midpoint”. Policies include: - midpoint: the average of end coordinates of a fragment lies in the interval. - any: any part of the fragment is in the interval.
verbose (bool, optional)
- Returns:
frag_ends – ‘NDArray’ with shape (n, 3) where column 1 contains fragment start position and column 2 contains fragment stop position, and column3 is 1 of on the + strand and is 0 if on the - strand. If no fragments exist in the specified minimum-maximum interval, the returned ‘ndarray’ will have a shape of (0, 3)
- Return type:
- finaletoolkit.utils.low_quality_read_pairs(read, min_mapq=30)#
Return True if the sequenced read described in read is not a properly paired read with a Phred score exceeding min_mapq. Based on epifluidlab/cofragr
Equivalent to -F 3852 -f 3
- Parameters:
read (pysam.AlignedSegment) – Sequenced read to check for quality, perfect pairing and if it is mapped.
min_mapq (int, optional) – Minimum Phred score for map quality of read. Defaults to 30.
- Returns:
is_low_quality – True if read is low quality, unmapped, not properly paired.
- Return type:
- finaletoolkit.utils.overlaps(contigs_1: ndarray[Any, dtype[_ScalarType_co]], starts_1: ndarray[Any, dtype[_ScalarType_co]], stops_1: ndarray[Any, dtype[_ScalarType_co]], contigs_2: ndarray[Any, dtype[_ScalarType_co]], starts_2: ndarray[Any, dtype[_ScalarType_co]], stops_2: ndarray[Any, dtype[_ScalarType_co]]) ndarray[Any, dtype[_ScalarType_co]] #
Function that performs vectorized computation of overlaps. Returns an array of same shape as contig_1 that is true if the intervals for set 1 each have any overlap with an interval in set 2.