Frag File Utilities#

finaletoolkit.utils.filter_bam(input_file: str, region_file: str | None = None, output_file: str | None = None, max_length: int | None = None, min_length: int | None = None, quality_threshold: int = 30, workers: int = 1, verbose: bool = False)#

Accepts the path to a BAM file and creates a bam file where all reads are read1 in a proper pair, exceed the specified quality threshold, do not intersect a region in the given blacklist file, and intersects with a region in the region bed.

Parameters:
  • input_bam (str) – Path string or AlignmentFile pointing to the BAM file to be filtered.

  • region_file (str, option) –

  • output_file (str, optional) –

  • min_length (int, optional) –

  • max_length (int, optional) –

  • quality_threshold (int, optional) –

  • workers (int, optional) –

  • verbose (bool, optional) –

Returns:

output_file

Return type:

str

finaletoolkit.utils.agg_bw(input_file: str | PathLike, interval_file: str | PathLike, output_file: str | PathLike, median_window_size: int = 120, mean: bool = False, strand_location: int = 5, verbose: bool = False)#

Takes a BigWig and an interval BED and aggregates signal along the intervals with a median filter.

For aggregating WPS signals, note that the median filter trims the ends of each interval by half of the window size of the filter while adjusting data. There are two way this can be approached in aggregation:

1. supply an interval file containing smaller intervals. e.g. if you used 5kb intervals for WPS and used a median filter window of 1kb, supply a BED file with 4kb windows to this function.

2. provide the size of the median filter window in median_window_size along with the original intervals. e.g if 5kb intervals were used for WPS and a 1kb median filter window was used, supply the 5kb bed file and median filter window size to this function.

Do not do both of these at once.

Parameters:
  • input_file (str) –

  • interval_file (str) –

  • output_file (str) –

  • median_window_size (int, optional) – default is 0

  • mean (bool) – use mean instead

  • strand_location (int) – which column (starting at 0) of the interval file contains the strand. Default is 5.

  • verbose (int or bool, optional) – default is False

Returns:

agg_scores

Return type:

NDArray

finaletoolkit.utils.chrom_sizes_to_list(chrom_sizes_file: str | Path) list[tuple[str][int]]#

Reads chromosome names and sizes from a CHROMSIZE file into a list.

Parameters:

chrom_sizes_file (str or Path) – Tab-delimited file with column for chrom names and column for chrom sizes.

Returns:

chrom names and sizes.

Return type:

list of string, int tuples

finaletoolkit.utils.chrom_sizes_to_dict(chrom_sizes_file: str | Path) list[tuple[str][int]]#

Reads chromosome names and sizes from a CHROMSIZE file into a dict.

Parameters:

chrom_sizes_file (str or Path) – Tab-delimited file with column for chrom names and column for chrom sizes.

Returns:

Chrom names are keys and values are int chrom sizes.

Return type:

dict

finaletoolkit.utils.frag_generator(input_file: str | pysam.AlignmentFile | pysam.TabixFile | Path, contig: str, quality_threshold: int = 30, start: int = None, stop: int = None, fraction_low: int = 120, fraction_high: int = 180, intersect_policy: str = 'midpoint', verbose: bool = False) Generator[tuple]#

Reads from BAM, SAM, or BED file and returns tuples containing contig (chromosome), start, stop (end), mapq, and strand for each fragment. Optionally may filter for mapq, size, and intersection with a region.

Parameters:
  • input_file (str or AlignmentFile) –

  • contig (str) –

  • quality_threshold (int, optional) –

  • start (int, optional) –

  • stop (int, optional) –

  • fraction_low (int, optional) – Specifies lowest fragment length included in array. Default is 120, equivalent to long fraction.

  • fraction_high (int, optional) – Specifies highest fragment length included in array. Default is 120, equivalent to long fraction.

  • intersect_policy (str, optional) – Specifies what policy is used to include fragments in the given interval. Default is “midpoint”. Policies include: - midpoint: the average of end coordinates of a fragment lies in the interval. - any: any part of the fragment is in the interval.

  • verbose (bool, optional) –

Returns:

frag_ends – Generator that yields tuples containing the region covered by each fragment in input_file.

Return type:

Generator

finaletoolkit.utils.frag_array(input_file: str | AlignmentFile | TabixFile | Path, contig: str, quality_threshold: int = 30, start: int | None = None, stop: int | None = None, fraction_low: int = 120, fraction_high: int = 180, intersect_policy: str = 'midpoint', verbose: bool = False) ndarray[Any, dtype[ScalarType]]#

Reads from BAM, SAM, or BED file and returns a three column matrix with fragment start and stop positions and strand.

Parameters:
  • input_file (str or AlignmentFile) –

  • contig (str) –

  • quality_threshold (int, optional) –

  • start (int, optional) –

  • stop (int, optional) –

  • fraction_low (int, optional) – Specifies lowest fragment length included in array. Default is 120, equivalent to long fraction.

  • fraction_high (int, optional) – Specifies highest fragment length included in array. Default is 120, equivalent to long fraction. intersect_policy : str, optional Specifies what policy is used to include fragments in the given interval. Default is “midpoint”. Policies include: - midpoint: the average of end coordinates of a fragment lies in the interval. - any: any part of the fragment is in the interval.

  • verbose (bool, optional) –

Returns:

frag_ends – ‘NDArray’ with shape (n, 3) where column 1 contains fragment start position and column 2 contains fragment stop position, and column3 is 1 of on the + strand and is 0 if on the - strand. If no fragments exist in the specified minimum-maximum interval, the returned ‘ndarray’ will have a shape of (0, 3)

Return type:

NDArray

finaletoolkit.utils.low_quality_read_pairs(read, min_mapq=30)#

Return True if the sequenced read described in read is not a properly paired read with a Phred score exceeding min_mapq. Based on epifluidlab/cofragr mmary_in_intervals.py

Equivalent to -F 3852 -f 3

Parameters:
  • read (pysam.AlignedSegment) – Sequenced read to check for quality, perfect pairing and if it is mapped.

  • min_mapq (int, optional) – Minimum Phred score for map quality of read. Defaults to 30.

Returns:

is_low_quality – True if read is low quality, unmapped, not properly paired.

Return type:

bool

finaletoolkit.utils.overlaps(contigs_1: ndarray[Any, dtype[ScalarType]], starts_1: ndarray[Any, dtype[ScalarType]], stops_1: ndarray[Any, dtype[ScalarType]], contigs_2: ndarray[Any, dtype[ScalarType]], starts_2: ndarray[Any, dtype[ScalarType]], stops_2: ndarray[Any, dtype[ScalarType]]) ndarray[Any, dtype[ScalarType]]#

Function that performs vectorized computation of overlaps. Returns an array of same shape as contig_1 that is true if the intervals for set 1 each have any overlap with an interval in set 2.