End-Motifs#

class finaletoolkit.frag.EndMotifFreqs(kmer_frequencies: Iterable[tuple[str, float]], k: int, quality_threshold: int = 20)#

Class that stores frequencies of end-motif k-mer frequencies and contains methods to manipulate this data.

Parameters:
  • kmer_frequencies (Iterable) – A Iterable of tuples, each containing a str representing a k-mer and a float representing its frequency

  • k (int) – Size of k-mers

  • quality_threshold (int, optional) – Minimum mapping quality used. Default is 30.

classmethod from_file(file_path: str | Path, quality_threshold: int, sep: str = '\t', header: int = 0) EndMotifFreqs#

Reads kmer frequency from a two-column tab-delimited file.

Parameters:
  • file_path (str) – Path string containing path to file.

  • sep (str, optional) – Delimiter used in file.

  • header (int, optional) – Number of lines to ignore at the head of the file.

Returns:

kmer_freqs

Return type:

EndMotifFreqs

motif_diversity_score() float#

Calculates a motif diversity score (MDS) using normalized Shannon entropy as described by Jiang et al (2020). This function is generalized for any k instead of just 4-mers.

to_tsv(output_file: str | Path, sep: str = '\t')#

Prints k-mer frequencies to a tsv

class finaletoolkit.frag.EndMotifsIntervals(intervals: list[tuple[tuple, dict]], k: int, quality_threshold: int = 20)#

Class that stores frequencies of end-motif k-mers over user-specified intervals and contains methods to manipulate this data.

Parameters:
  • intervals (list) – A list of tuples, each containing a tuple representing a genomic interval (chrom, 0-based start, 1-based stop) and a dict that maps kmers to frequencies in the interval.

  • k (int) – Size of k-mers

  • quality_threshold (int, optional) – Minimum mapping quality used. Default is 30.

freq(kmer: str) dict[tuple[str, int, int], float]#

Returns a list of intervals and associated frquency for given kmer. Results are in the form (chrom, 0-based start, 1-based stop, frequency).

classmethod from_file(file_path: str, quality_threshold: int, sep: str = ',', header: int = 0) EndMotifsIntervals#

Reads kmer frequency from a tab-delimited file. Expected columns are contig, start, stop, name, count, (kmers). Because exporting to file includes an option to turn counts to a fraction, this doesn’t perfectly correspond to replicating the other file.

Parameters:
  • file_path (str) – Path string containing path to file.

  • quality_threshold (int) – MAPQ filter used. Only used for some calculations.

  • sep (str, optional) – Delimiter used in file.

Returns:

kmer_freqs

Return type:

EndMotifsIntervals

mds_bed(output_file: str | Path, sep: str = '\t')#

Writes MDS for each interval to a bed/bedgraph file.

motif_diversity_score() list[tuple[tuple, float]]#

Calculates a motif diversity score (MDS) for each interval using normalized Shannon entropy as described by Jiang et al (2020). This function is generalized for any k instead of just 4-mers.

to_bed(kmer: str, output_file: str | Path, calc_freq: bool = True, sep: str = '\t')#

Take frequency of specified kmer and writes to BED.

Parameters:
  • output_file (str) – File to write frequencies to.

  • calc_freq (bool, optional) – Calculates frequency of motifs if true. Otherwise, writes counts for each motif. Default is true.

  • sep (str, optional) – Separator for table. Tab-separated by default.

to_bedgraph(kmer: str, output_file: str | Path, calc_freq: bool = True, sep: str = '\t')#

Take frequency of specified kmer and writes to bedgraph.

Parameters:
  • output_file (str) – File to write frequencies to.

  • calc_freq (bool, optional) – Calculates frequency of motifs if true. Otherwise, writes counts for each motif. Default is true.

  • sep (str, optional) – Separator for table. Tab-separated by default.

to_tsv(output_file: str | Path, calc_freq: bool = True, sep: str = '\t')#

Writes all intervals and associated frquencies to file. Columns are contig, start, stop, name, count, (kmers).

Parameters:
  • output_file (str) – File to write frequencies to.

  • calc_freq (bool, optional) – Calculates frequency of motifs if true. Otherwise, writes counts for each motif. Default is true.

  • sep (str, optional) – Separator for table. Tab-separated by default.

finaletoolkit.frag.region_end_motifs(input_file: str, contig: str, start: int, stop: int, refseq_file: str | Path, k: int = 4, fraction_low: int = 10, fraction_high: int = 600, both_strands: bool = True, negative_strand: bool = False, output_file: str | None = None, quality_threshold: int = 20, verbose: bool | int = False) dict#

Function that reads fragments in the specified region from a BAM, CRAM, or tabix indexed fragment file and returns the 5’ k-mer ( default is 4-mer) end motif counts as a dictionary. This function reproduces the methodology of Zhou et al (2023).

Parameters:
  • input_file (str) – Path of BAM, CRAM, or Frag.gz containing pair-end reads.

  • contig (str) – Name of contig or chromosome for region.

  • start (int) – 0-based start coordinate.

  • stop (int) – 1-based end coordinate.

  • refseq_file (str or Path) – 2bit file with reference sequence input_file was aligned to.

  • k (int, optional) – Length of end motif kmer. Default is 4.

  • fraction_low (int, optional) – Minimum fragment length.

  • fraction_high (int, optional) – Maximum fragment length.

  • both_strands (bool, optional) – Choose whether to use forward 5’ ends only or use 5’ ends for both ends of PE reads. If negative_strand is True, only reverse 5’ ends are used.

  • negative_strand (bool) – Only considered if the both_strands option is False. When set to True, only ends on the negative strand are considered.

  • output_file (None or str, optional) – Ignored.

  • quality_threshold (int, optional)

  • verbose (bool or int, optional)

Returns:

end_motif_freq

Return type:

dict

finaletoolkit.frag.end_motifs(input_file: str, refseq_file: str | Path, k: int = 4, min_length: int = 10, max_length: int = 600, both_strands: bool = True, negative_strand: bool = False, output_file: None | str = None, quality_threshold: int = 30, workers: int = 1, verbose: bool | int = False, fraction_low: int | None = None, fraction_high: int | None = None) EndMotifFreqs#

Function that reads fragments from a BAM, CRAM, or tabix indexed file and returns the 5’ k-mer (default is 4-mer) end motif frequencies as a dictionary. Optionally writes data to a tsv. This function reproduces the methodology of Zhou et al (2023).

Parameters:
  • input_file (str) – BAM, CRAM, or Frag.gz file with paired-end reads.

  • refseq_file (str or Path) – 2bit file with sequence of reference genome input_file is aligned to.

  • k (int, optional) – Length of end motif kmer. Default is 4.

  • min_length (int or None, optional) – Minimum length of fragments to be included.

  • max_length (int or None, optional) – Maximum length of fragments to be included.

  • both_strands (bool) – Indicate whether to calculate 5’ end motifs on both positive and negative strands or not. If False, only 5’ ends of the positive strand are considered, unless the negative_strand option is set to True. Default is True.

  • negative_strand (bool) – Only considered if the both_strands option is False. When set to True, only ends on the negative strand are considered.

  • output_file (None or str, optional) – File path to write results to. Either tsv or csv. Default is None.

  • quality_threshold (int, optional) – Minimum MAPQ to filter. Default is 30.

  • workers (int, optional) – Number of worker processes. Default is 1.

  • verbose (bool or int, optional)

  • fraction_low (int or None, optional) – Alias for min_length. Deprecated.

  • fraction_high (int or None, optional) – Alias for max_length. Deprecated.

Returns:

end_motif_freq

Return type:

EndMotifFreqs

finaletoolkit.frag.interval_end_motifs(input_file: str, refseq_file: str | Path, intervals: str | Iterable[tuple[str, int, int, str]], k: int = 4, min_length: int | None = 10, max_length: int | None = 600, both_strands: bool = True, negative_strand: bool = False, output_file: str | None = None, quality_threshold: int = 30, workers: int = 1, verbose: bool | int = False, fraction_low: int | None = None, fraction_high: int | None = None) EndMotifsIntervals#

Function that reads fragments from a BAM, CRAM, or tabix indexed file and user-specified intervals and returns the 5’ k-mer (default is 4-mer) end motif. Optionally writes data to a tsv.

Parameters:
  • input_file (str) – Path of BAM, CRAM, or Frag.gz containing pair-end reads.

  • refseq_file (str or Path) – Path of 2bit file for reference genome that reads are aligned to.

  • intervals (str or tuple) – Path of BED file containing intervals or list of tuples (chrom, start, stop, name).

  • k (int, optional) – Length of end motif kmer. Default is 4.

  • both_strands (bool) – Indicate whether to calculate 5’ end motifs on both positive and negative strands or not. If False, only 5’ ends of the positive strand are considered, unless the negative_strand option is set to True. Default is True.

  • negative_strand (bool) – Only considered if the both_strands option is False. When set to True, only ends on the negative strand are considered.

  • output_file (None or str, optional) – File path to write results to. Either tsv or csv.

  • quality_threshold (int, optional) – Minimum MAPQ to filter.

  • workers (int, optional) – Number of worker processes.

  • verbose (bool or int, optional)

Returns:

end_motif_freq

Return type:

EndMotifIntervals