End-Motifs#
- class finaletoolkit.frag.EndMotifFreqs(kmer_frequencies: Iterable[tuple[str, float]], k: int, quality_threshold: int = 20)#
Class that stores frequencies of end-motif k-mer frequencies and contains methods to manipulate this data.
- Parameters:
kmer_frequencies (Iterable) – A Iterable of tuples, each containing a str representing a k-mer and a float representing its frequency
k (int) – Size of k-mers
quality_threshold (int, optional) – Minimum mapping quality used. Default is 30.
- classmethod from_file(file_path: str | Path, quality_threshold: int, sep: str = '\t', header: int = 0) EndMotifFreqs #
Reads kmer frequency from a two-column tab-delimited file.
- Parameters:
file_path (str) – Path string containing path to file.
sep (str, optional) – Delimiter used in file.
header (int, optional) – Number of lines to ignore at the head of the file.
- Returns:
kmer_freqs
- Return type:
- motif_diversity_score() float #
Calculates a motif diversity score (MDS) using normalized Shannon entropy as described by Jiang et al (2020). This function is generalized for any k instead of just 4-mers.
- to_tsv(output_file: str | Path, sep: str = '\t')#
Prints k-mer frequencies to a tsv
- class finaletoolkit.frag.EndMotifsIntervals(intervals: list[tuple[tuple, dict]], k: int, quality_threshold: int = 20)#
Class that stores frequencies of end-motif k-mers over user-specified intervals and contains methods to manipulate this data.
- Parameters:
intervals (list) – A list of tuples, each containing a tuple representing a genomic interval (chrom, 0-based start, 1-based stop) and a dict that maps kmers to frequencies in the interval.
k (int) – Size of k-mers
quality_threshold (int, optional) – Minimum mapping quality used. Default is 30.
- freq(kmer: str) dict[tuple[str, int, int], float] #
Returns a list of intervals and associated frquency for given kmer. Results are in the form (chrom, 0-based start, 1-based stop, frequency).
- classmethod from_file(file_path: str, quality_threshold: int, sep: str = ',', header: int = 0) EndMotifsIntervals #
Reads kmer frequency from a tab-delimited file. Expected columns are contig, start, stop, name, count, (kmers). Because exporting to file includes an option to turn counts to a fraction, this doesn’t perfectly correspond to replicating the other file.
- Parameters:
file_path (str) – Path string containing path to file.
quality_threshold (int) – MAPQ filter used. Only used for some calculations.
sep (str, optional) – Delimiter used in file.
- Returns:
kmer_freqs
- Return type:
- mds_bed(output_file: str | Path, sep: str = '\t')#
Writes MDS for each interval to a bed/bedgraph file.
- motif_diversity_score() list[tuple[tuple, float]] #
Calculates a motif diversity score (MDS) for each interval using normalized Shannon entropy as described by Jiang et al (2020). This function is generalized for any k instead of just 4-mers.
- to_bed(kmer: str, output_file: str | Path, calc_freq: bool = True, sep: str = '\t')#
Take frequency of specified kmer and writes to BED.
- Parameters:
output_file (str) – File to write frequencies to.
calc_freq (bool, optional) – Calculates frequency of motifs if true. Otherwise, writes counts for each motif. Default is true.
sep (str, optional) – Separator for table. Tab-separated by default.
- to_bedgraph(kmer: str, output_file: str | Path, calc_freq: bool = True, sep: str = '\t')#
Take frequency of specified kmer and writes to bedgraph.
- Parameters:
output_file (str) – File to write frequencies to.
calc_freq (bool, optional) – Calculates frequency of motifs if true. Otherwise, writes counts for each motif. Default is true.
sep (str, optional) – Separator for table. Tab-separated by default.
- to_tsv(output_file: str | Path, calc_freq: bool = True, sep: str = '\t')#
Writes all intervals and associated frquencies to file. Columns are contig, start, stop, name, count, (kmers).
- Parameters:
output_file (str) – File to write frequencies to.
calc_freq (bool, optional) – Calculates frequency of motifs if true. Otherwise, writes counts for each motif. Default is true.
sep (str, optional) – Separator for table. Tab-separated by default.
- finaletoolkit.frag.region_end_motifs(input_file: str, contig: str, start: int, stop: int, refseq_file: str | Path, k: int = 4, fraction_low: int = 10, fraction_high: int = 600, both_strands: bool = True, negative_strand: bool = False, output_file: str | None = None, quality_threshold: int = 20, verbose: bool | int = False) dict #
Function that reads fragments in the specified region from a BAM, CRAM, or tabix indexed fragment file and returns the 5’ k-mer ( default is 4-mer) end motif counts as a dictionary. This function reproduces the methodology of Zhou et al (2023).
- Parameters:
input_file (str) – Path of BAM, CRAM, or Frag.gz containing pair-end reads.
contig (str) – Name of contig or chromosome for region.
start (int) – 0-based start coordinate.
stop (int) – 1-based end coordinate.
refseq_file (str or Path) – 2bit file with reference sequence input_file was aligned to.
k (int, optional) – Length of end motif kmer. Default is 4.
fraction_low (int, optional) – Minimum fragment length.
fraction_high (int, optional) – Maximum fragment length.
both_strands (bool, optional) – Choose whether to use forward 5’ ends only or use 5’ ends for both ends of PE reads. If negative_strand is True, only reverse 5’ ends are used.
negative_strand (bool) – Only considered if the both_strands option is False. When set to True, only ends on the negative strand are considered.
output_file (None or str, optional) – Ignored.
quality_threshold (int, optional)
verbose (bool or int, optional)
- Returns:
end_motif_freq
- Return type:
dict
- finaletoolkit.frag.end_motifs(input_file: str, refseq_file: str | Path, k: int = 4, min_length: int = 10, max_length: int = 600, both_strands: bool = True, negative_strand: bool = False, output_file: None | str = None, quality_threshold: int = 30, workers: int = 1, verbose: bool | int = False, fraction_low: int | None = None, fraction_high: int | None = None) EndMotifFreqs #
Function that reads fragments from a BAM, CRAM, or tabix indexed file and returns the 5’ k-mer (default is 4-mer) end motif frequencies as a dictionary. Optionally writes data to a tsv. This function reproduces the methodology of Zhou et al (2023).
- Parameters:
input_file (str) – BAM, CRAM, or Frag.gz file with paired-end reads.
refseq_file (str or Path) – 2bit file with sequence of reference genome input_file is aligned to.
k (int, optional) – Length of end motif kmer. Default is 4.
min_length (int or None, optional) – Minimum length of fragments to be included.
max_length (int or None, optional) – Maximum length of fragments to be included.
both_strands (bool) – Indicate whether to calculate 5’ end motifs on both positive and negative strands or not. If False, only 5’ ends of the positive strand are considered, unless the negative_strand option is set to True. Default is True.
negative_strand (bool) – Only considered if the both_strands option is False. When set to True, only ends on the negative strand are considered.
output_file (None or str, optional) – File path to write results to. Either tsv or csv. Default is None.
quality_threshold (int, optional) – Minimum MAPQ to filter. Default is 30.
workers (int, optional) – Number of worker processes. Default is 1.
verbose (bool or int, optional)
fraction_low (int or None, optional) – Alias for min_length. Deprecated.
fraction_high (int or None, optional) – Alias for max_length. Deprecated.
- Returns:
end_motif_freq
- Return type:
- finaletoolkit.frag.interval_end_motifs(input_file: str, refseq_file: str | Path, intervals: str | Iterable[tuple[str, int, int, str]], k: int = 4, min_length: int | None = 10, max_length: int | None = 600, both_strands: bool = True, negative_strand: bool = False, output_file: str | None = None, quality_threshold: int = 30, workers: int = 1, verbose: bool | int = False, fraction_low: int | None = None, fraction_high: int | None = None) EndMotifsIntervals #
Function that reads fragments from a BAM, CRAM, or tabix indexed file and user-specified intervals and returns the 5’ k-mer (default is 4-mer) end motif. Optionally writes data to a tsv.
- Parameters:
input_file (str) – Path of BAM, CRAM, or Frag.gz containing pair-end reads.
refseq_file (str or Path) – Path of 2bit file for reference genome that reads are aligned to.
intervals (str or tuple) – Path of BED file containing intervals or list of tuples (chrom, start, stop, name).
k (int, optional) – Length of end motif kmer. Default is 4.
both_strands (bool) – Indicate whether to calculate 5’ end motifs on both positive and negative strands or not. If False, only 5’ ends of the positive strand are considered, unless the negative_strand option is set to True. Default is True.
negative_strand (bool) – Only considered if the both_strands option is False. When set to True, only ends on the negative strand are considered.
output_file (None or str, optional) – File path to write results to. Either tsv or csv.
quality_threshold (int, optional) – Minimum MAPQ to filter.
workers (int, optional) – Number of worker processes.
verbose (bool or int, optional)
- Returns:
end_motif_freq
- Return type:
EndMotifIntervals