End-Motifs#
- class finaletoolkit.frag.EndMotifFreqs(kmer_frequencies: Iterable[tuple[str, float]], k: int, quality_threshold: int = 20)#
Class that stores frequencies of end-motif k-mer frequencies and contains methods to manipulate this data.
- Parameters:
kmer_frequencies (Iterable) – A Iterable of tuples, each containing a str representing a k-mer and a float representing its frequency
k (int) – Size of k-mers
quality_threshold (int, optional) – Minimum mapping quality used. Default is 30.
- classmethod from_file(file_path: str | Path, quality_threshold: int, sep: str = '\t', header: int = 0) EndMotifFreqs #
Reads kmer frequency from a two-column tab-delimited file.
- Parameters:
file_path (str) – Path string containing path to file.
sep (str, optional) – Delimiter used in file.
header (int, optional) – Number of lines to ignore at the head of the file.
- Returns:
kmer_freqs
- Return type:
- motif_diversity_score() float #
Calculates a motif diversity score (MDS) using normalized Shannon entropy as described by Jiang et al (2020). This function is generalized for any k instead of just 4-mers.
- to_tsv(output_file: str | Path, sep: str = '\t')#
Prints k-mer frequencies to a tsv
- class finaletoolkit.frag.EndMotifsIntervals(intervals: Iterable[tuple[tuple, dict]], k: int, quality_threshold: int = 20)#
Class that stores frequencies of end-motif k-mers over user-specified intervals and contains methods to manipulate this data.
- Parameters:
intervals (Iterable) – A collection of tuples, each containing a tuple representing a genomic interval (chrom, 0-based start, 1-based stop) and a dict that maps kmers to frequencies in the interval.
k (int) – Size of k-mers
quality_threshold (int, optional) – Minimum mapping quality used. Default is 30.
- freq(kmer: str) list[tuple[str, int, int, float]] #
Returns a list of intervals and associated frquency for given kmer. Results are in the form (chrom, 0-based start, 1-based stop, frequency).
- classmethod from_file(file_path: str, quality_threshold: int, sep: str = ',') EndMotifFreqs #
Reads kmer frequency from a tab-delimited file. Expected columns are contig, start, stop, name, count, (kmers). Because exporting to file includes an option to turn counts to a fraction, this doesn’t perfectly correspond to replicating the other file.
- Parameters:
file_path (str) – Path string containing path to file.
quality_threshold (int) – MAPQ filter used. Only used for some calculations.
sep (str, optional) – Delimiter used in file.
- Returns:
kmer_freqs
- Return type:
- mds_bed(output_file: str | Path, sep: str = '\t')#
Writes MDS for each interval to a bed/bedgraph file.
- motif_diversity_score() list[tuple[tuple, float]] #
Calculates a motif diversity score (MDS) for each interval using normalized Shannon entropy as described by Jiang et al (2020). This function is generalized for any k instead of just 4-mers.
- to_bed(kmer: str, output_file: str | Path, calc_freq: bool = True, sep: str = '\t')#
Take frequency of specified kmer and writes to BED.
- Parameters:
output_file (str) – File to write frequencies to.
calc_freq (bool, optional) – Calculates frequency of motifs if true. Otherwise, writes counts for each motif. Default is true.
sep (str, optional) – Separator for table. Tab-separated by default.
- to_bedgraph(kmer: str, output_file: str | Path, calc_freq: bool = True, sep: str = '\t')#
Take frequency of specified kmer and writes to bedgraph.
- Parameters:
output_file (str) – File to write frequencies to.
calc_freq (bool, optional) – Calculates frequency of motifs if true. Otherwise, writes counts for each motif. Default is true.
sep (str, optional) – Separator for table. Tab-separated by default.
- to_tsv(output_file: str | Path, calc_freq: bool = True, sep: str = '\t')#
Writes all intervals and associated frquencies to file. Columns are contig, start, stop, name, count, (kmers).
- Parameters:
output_file (str) – File to write frequencies to.
calc_freq (bool, optional) – Calculates frequency of motifs if true. Otherwise, writes counts for each motif. Default is true.
sep (str, optional) – Separator for table. Tab-separated by default.
- finaletoolkit.frag.region_end_motifs(input_file: str, contig: str, start: int, stop: int, refseq_file: str | Path, k: int = 4, fraction_low: int = 10, fraction_high: int = 600, both_strands: bool = True, output_file: None | str = None, quality_threshold: int = 20, verbose: bool | int = False) dict #
Function that reads fragments in the specified region from a BAM, SAM, or tabix indexed file and returns the 5’ k-mer (default is 4-mer) end motif counts as a dictionary. This function reproduces the methodology of Zhou et al (2023).
- Parameters:
input_file (str) – Path of SAM, BAM, CRAM, or Frag.gz containing pair-end reads.
contig (str) – Name of contig or chromosome for region.
start (int) – 0-based start coordinate.
stop (int) – 1-based end coordinate.
refseq_file (str or Path) – 2bit file with reference sequence input_file was aligned to.
k (int, optional) – Length of end motif kmer. Default is 4.
fraction_low (int, optional) – Minimum fragment length.
fraction_high (int, optional) – Maximum fragment length.
both_strands (bool, optional) – Choose whether to use forward 5’ ends only or use 5’ ends for both ends of PE reads.
output_file (None or str, optional) –
quality_threshold (int, optional) –
verbose (bool or int, optional) –
- Returns:
end_motif_freq
- Return type:
dict
- finaletoolkit.frag.end_motifs(input_file: str, refseq_file: str | Path, k: int = 4, fraction_low: int = 10, fraction_high: int = 600, both_strands: bool = False, output_file: None | str = None, quality_threshold: int = 30, workers: int = 1, verbose: bool | int = False) EndMotifFreqs #
Function that reads fragments from a BAM, SAM, or tabix indexed file and returns the 5’ k-mer (default is 4-mer) end motif frequencies as a dictionary. Optionally writes data to a tsv. This function reproduces the methodology of Zhou et al (2023).
- Parameters:
input_file (str) – SAM, BAM, CRAM, or Frag.gz file with paired-end reads.
refseq_file (str or Path) – 2bit file with sequence of reference genome input_file is aligned to.
k (int, optional) – Length of end motif kmer. Default is 4.
output_file (None or str, optional) – File path to write results to. Either tsv or csv.
quality_threshold (int, optional) – Minimum MAPQ to filter.
workers (int, optional) – Number of worker processes.
verbose (bool or int, optional) –
- Returns:
end_motif_freq
- Return type:
- finaletoolkit.frag.interval_end_motifs(input_file: str, refseq_file: str | Path, intervals: str | Iterable[tuple[str, int, int, str]], k: int = 4, fraction_low: int = 10, fraction_high: int = 600, both_strands: bool = True, output_file: None | str = None, quality_threshold: int = 30, workers: int = 1, verbose: bool | int = False) EndMotifsIntervals #
Function that reads fragments from a BAM, SAM, or tabix indexed file and user-specified intervals and returns the 5’ k-mer (default is 4-mer) end motif. Optionally writes data to a tsv.
- Parameters:
input_file (str) – Path of SAM, BAM, CRAM, or Frag.gz containing pair-end reads.
refseq_file (str or Path) – Path of 2bit file for reference genome that reads are aligned to.
intervals (str or tuple) – Path of BED file containing intervals or list of tuples (chrom, start, stop, name).
k (int, optional) – Length of end motif kmer. Default is 4.
output_file (None or str, optional) – File path to write results to. Either tsv or csv.
quality_threshold (int, optional) – Minimum MAPQ to filter.
workers (int, optional) – Number of worker processes.
verbose (bool or int, optional) –
- Returns:
end_motif_freq
- Return type:
EndMotifIntervals