End-Motifs#

class finaletoolkit.frag.EndMotifFreqs(kmer_frequencies: Iterable[tuple[str, float]], k: int, quality_threshold: int = 20)#

Class that stores frequencies of end-motif k-mer frequencies and contains methods to manipulate this data.

Parameters#

kmer_frequenciesIterable

A Iterable of tuples, each containing a str representing a k-mer and a float representing its frequency

kint

Size of k-mers

quality_threshold: int, optional

Minimum mapping quality used. Default is 30.

classmethod from_file(file_path: str | Path, quality_threshold: int, sep: str = '\t', header: int = 0) EndMotifFreqs#

Reads kmer frequency from a two-column tab-delimited file.

Parameters#

file_pathstr

Path string containing path to file.

sepstr, optional

Delimiter used in file.

headerint, optional

Number of lines to ignore at the head of the file.

Return#

kmer_freqs : EndMotifFreqs

motif_diversity_score() float#

Calculates a motif diversity score (MDS) using normalized Shannon entropy as described by Jiang et al (2020). This function is generalized for any k instead of just 4-mers.

to_tsv(output_file: str | Path, sep: str = '\t')#

Prints k-mer frequencies to a tsv

class finaletoolkit.frag.EndMotifsIntervals(intervals: Iterable[tuple[tuple, dict]], k: int, quality_threshold: int = 20)#

Class that stores frequencies of end-motif k-mers over user-specified intervals and contains methods to manipulate this data.

Parameters#

intervalsIterable

A collection of tuples, each containing a tuple representing a genomic interval (chrom, 0-based start, 1-based stop) and a dict that maps kmers to frequencies in the interval.

kint

Size of k-mers

quality_threshold: int, optional

Minimum mapping quality used. Default is 30.

freq(kmer: str) list[Tuple[str, int, int, float]]#

Returns a list of intervals and associated frquency for given kmer. Results are in the form (chrom, 0-based start, 1-based stop, frequency).

classmethod from_file(file_path: str, quality_threshold: int, sep: str = ',') EndMotifFreqs#

Reads kmer frequency from a tab-delimited file. Expected columns are contig, start, stop, name, count, *kmers. Because exporting to file includes an option to turn counts to a fraction, this doesn’t perfectly correspond to replicating the other file.

Parameters#

file_pathstr

Path string containing path to file.

quality_thresholdint

MAPQ filter used. Only used for some calculations.

sepstr, optional

Delimiter used in file.

Return#

kmer_freqs : EndMotifFreqs

mds_bed(output_file: str | Path, sep: str = '\t')#

Writes MDS for each interval to a bed/bedgraph file.

motif_diversity_score() list[tuple[tuple, float]]#

Calculates a motif diversity score (MDS) for each interval using normalized Shannon entropy as described by Jiang et al (2020). This function is generalized for any k instead of just 4-mers.

to_bed(kmer: str, output_file: str | Path, calc_freq: bool = True, sep: str = '\t')#

Take frequency of specified kmer and writes to BED.

Parameters#

output_file: str

File to write frequencies to.

calc_freq: bool, optional

Calculates frequency of motifs if true. Otherwise, writes counts for each motif. Default is true.

sep: str, optional

Separator for table. Tab-separated by default.

to_bedgraph(kmer: str, output_file: str | Path, calc_freq: bool = True, sep: str = '\t')#

Take frequency of specified kmer and writes to bedgraph.

Parameters#

output_file: str

File to write frequencies to.

calc_freq: bool, optional

Calculates frequency of motifs if true. Otherwise, writes counts for each motif. Default is true.

sep: str, optional

Separator for table. Tab-separated by default.

to_tsv(output_file: str | Path, calc_freq: bool = True, sep: str = '\t')#

Writes all intervals and associated frquencies to file. Columns are contig, start, stop, name, count, *kmers.

Parameters#

output_file: str

File to write frequencies to.

calc_freq: bool, optional

Calculates frequency of motifs if true. Otherwise, writes counts for each motif. Default is true.

sep: str, optional

Separator for table. Tab-separated by default.

finaletoolkit.frag.region_end_motifs(input_file: str, contig: str, start: int, stop: int, refseq_file: str | Path, k: int = 4, fraction_low: int = 10, fraction_high: int = 600, both_strands: bool = True, output_file: None | str = None, quality_threshold: int = 20, verbose: bool | int = False) dict#

Function that reads fragments in the specified region from a BAM, SAM, or tabix indexed file and returns the 5’ k-mer (default is 4-mer) end motif counts as a dictionary. This function reproduces the methodology of Zhou et al (2023).

Parameters#

input_filestr

Path of SAM, BAM, CRAM, or Frag.gz containing pair-end reads.

contigstr

Name of contig or chromosome for region.

startint

0-based start coordinate.

stopint

1-based end coordinate.

refseq_filestr or Path

2bit file with reference sequence input_file was aligned to.

kint, optional

Length of end motif kmer. Default is 4.

fraction_low: int, optional

Minimum fragment length.

fraction_high: int, optional

Maximum fragment length.

both_strands: bool, optional

Choose whether to use forward 5’ ends only or use 5’ ends for both ends of PE reads.

output_file : None or str, optional quality_threshold : int, optional verbose : bool or int, optional

Return#

end_motif_freq : dict

finaletoolkit.frag.end_motifs(input_file: str, refseq_file: str | Path, k: int = 4, fraction_low: int = 10, fraction_high: int = 600, both_strands: bool = False, output_file: None | str = None, quality_threshold: int = 30, workers: int = 1, verbose: bool | int = False) EndMotifFreqs#

Function that reads fragments from a BAM, SAM, or tabix indexed file and returns the 5’ k-mer (default is 4-mer) end motif frequencies as a dictionary. Optionally writes data to a tsv. This function reproduces the methodology of Zhou et al (2023).

Parameters#

input_filestr

SAM, BAM, CRAM, or Frag.gz file with paired-end reads.

refseq_filestr or Path

2bit file with sequence of reference genome input_file is aligned to.

kint, optional

Length of end motif kmer. Default is 4.

output_fileNone or str, optional

File path to write results to. Either tsv or csv.

quality_thresholdint, optional

Minimum MAPQ to filter.

workersint, optional

Number of worker processes.

verbose : bool or int, optional

Return#

end_motif_freq : EndMotifFreqs

finaletoolkit.frag.interval_end_motifs(input_file: str, refseq_file: str | Path, intervals: str | Iterable[Tuple[str, int, int, str]], k: int = 4, fraction_low: int = 10, fraction_high: int = 600, both_strands: bool = True, output_file: None | str = None, quality_threshold: int = 30, workers: int = 1, verbose: bool | int = False) EndMotifsIntervals#

Function that reads fragments from a BAM, SAM, or tabix indexed file and user-specified intervals and returns the 5’ k-mer (default is 4-mer) end motif. Optionally writes data to a tsv.

Parameters#

input_filestr

Path of SAM, BAM, CRAM, or Frag.gz containing pair-end reads.

refseq_filestr or Path

Path of 2bit file for reference genome that reads are aligned to.

intervalsstr or tuple

Path of BED file containing intervals or list of tuples (chrom, start, stop, name).

kint, optional

Length of end motif kmer. Default is 4.

output_fileNone or str, optional

File path to write results to. Either tsv or csv.

quality_thresholdint, optional

Minimum MAPQ to filter.

workersint, optional

Number of worker processes.

verbose : bool or int, optional

Return#

end_motif_freq : EndMotifIntervals