End-Motifs#
- class finaletoolkit.frag.EndMotifFreqs(kmer_frequencies: Iterable[tuple[str, float]], k: int, quality_threshold: int = 20)#
Class that stores frequencies of end-motif k-mer frequencies and contains methods to manipulate this data.
Parameters#
- kmer_frequenciesIterable
A Iterable of tuples, each containing a str representing a k-mer and a float representing its frequency
- kint
Size of k-mers
- quality_threshold: int, optional
Minimum mapping quality used. Default is 30.
- classmethod from_file(file_path: str | Path, quality_threshold: int, sep: str = '\t', header: int = 0) EndMotifFreqs #
Reads kmer frequency from a two-column tab-delimited file.
Parameters#
- file_pathstr
Path string containing path to file.
- sepstr, optional
Delimiter used in file.
- headerint, optional
Number of lines to ignore at the head of the file.
Return#
kmer_freqs : EndMotifFreqs
- motif_diversity_score() float #
Calculates a motif diversity score (MDS) using normalized Shannon entropy as described by Jiang et al (2020). This function is generalized for any k instead of just 4-mers.
- to_tsv(output_file: str | Path, sep: str = '\t')#
Prints k-mer frequencies to a tsv
- class finaletoolkit.frag.EndMotifsIntervals(intervals: Iterable[tuple[tuple, dict]], k: int, quality_threshold: int = 20)#
Class that stores frequencies of end-motif k-mers over user-specified intervals and contains methods to manipulate this data.
Parameters#
- intervalsIterable
A collection of tuples, each containing a tuple representing a genomic interval (chrom, 0-based start, 1-based stop) and a dict that maps kmers to frequencies in the interval.
- kint
Size of k-mers
- quality_threshold: int, optional
Minimum mapping quality used. Default is 30.
- freq(kmer: str) list[Tuple[str, int, int, float]] #
Returns a list of intervals and associated frquency for given kmer. Results are in the form (chrom, 0-based start, 1-based stop, frequency).
- classmethod from_file(file_path: str, quality_threshold: int, sep: str = ',') EndMotifFreqs #
Reads kmer frequency from a tab-delimited file. Expected columns are contig, start, stop, name, count, *kmers. Because exporting to file includes an option to turn counts to a fraction, this doesn’t perfectly correspond to replicating the other file.
Parameters#
- file_pathstr
Path string containing path to file.
- quality_thresholdint
MAPQ filter used. Only used for some calculations.
- sepstr, optional
Delimiter used in file.
Return#
kmer_freqs : EndMotifFreqs
- mds_bed(output_file: str | Path, sep: str = '\t')#
Writes MDS for each interval to a bed/bedgraph file.
- motif_diversity_score() list[tuple[tuple, float]] #
Calculates a motif diversity score (MDS) for each interval using normalized Shannon entropy as described by Jiang et al (2020). This function is generalized for any k instead of just 4-mers.
- to_bed(kmer: str, output_file: str | Path, calc_freq: bool = True, sep: str = '\t')#
Take frequency of specified kmer and writes to BED.
Parameters#
- output_file: str
File to write frequencies to.
- calc_freq: bool, optional
Calculates frequency of motifs if true. Otherwise, writes counts for each motif. Default is true.
- sep: str, optional
Separator for table. Tab-separated by default.
- to_bedgraph(kmer: str, output_file: str | Path, calc_freq: bool = True, sep: str = '\t')#
Take frequency of specified kmer and writes to bedgraph.
Parameters#
- output_file: str
File to write frequencies to.
- calc_freq: bool, optional
Calculates frequency of motifs if true. Otherwise, writes counts for each motif. Default is true.
- sep: str, optional
Separator for table. Tab-separated by default.
- to_tsv(output_file: str | Path, calc_freq: bool = True, sep: str = '\t')#
Writes all intervals and associated frquencies to file. Columns are contig, start, stop, name, count, *kmers.
Parameters#
- output_file: str
File to write frequencies to.
- calc_freq: bool, optional
Calculates frequency of motifs if true. Otherwise, writes counts for each motif. Default is true.
- sep: str, optional
Separator for table. Tab-separated by default.
- finaletoolkit.frag.region_end_motifs(input_file: str, contig: str, start: int, stop: int, refseq_file: str | Path, k: int = 4, fraction_low: int = 10, fraction_high: int = 600, both_strands: bool = True, output_file: None | str = None, quality_threshold: int = 20, verbose: bool | int = False) dict #
Function that reads fragments in the specified region from a BAM, SAM, or tabix indexed file and returns the 5’ k-mer (default is 4-mer) end motif counts as a dictionary. This function reproduces the methodology of Zhou et al (2023).
Parameters#
- input_filestr
Path of SAM, BAM, CRAM, or Frag.gz containing pair-end reads.
- contigstr
Name of contig or chromosome for region.
- startint
0-based start coordinate.
- stopint
1-based end coordinate.
- refseq_filestr or Path
2bit file with reference sequence input_file was aligned to.
- kint, optional
Length of end motif kmer. Default is 4.
- fraction_low: int, optional
Minimum fragment length.
- fraction_high: int, optional
Maximum fragment length.
- both_strands: bool, optional
Choose whether to use forward 5’ ends only or use 5’ ends for both ends of PE reads.
output_file : None or str, optional quality_threshold : int, optional verbose : bool or int, optional
Return#
end_motif_freq : dict
- finaletoolkit.frag.end_motifs(input_file: str, refseq_file: str | Path, k: int = 4, fraction_low: int = 10, fraction_high: int = 600, both_strands: bool = False, output_file: None | str = None, quality_threshold: int = 30, workers: int = 1, verbose: bool | int = False) EndMotifFreqs #
Function that reads fragments from a BAM, SAM, or tabix indexed file and returns the 5’ k-mer (default is 4-mer) end motif frequencies as a dictionary. Optionally writes data to a tsv. This function reproduces the methodology of Zhou et al (2023).
Parameters#
- input_filestr
SAM, BAM, CRAM, or Frag.gz file with paired-end reads.
- refseq_filestr or Path
2bit file with sequence of reference genome input_file is aligned to.
- kint, optional
Length of end motif kmer. Default is 4.
- output_fileNone or str, optional
File path to write results to. Either tsv or csv.
- quality_thresholdint, optional
Minimum MAPQ to filter.
- workersint, optional
Number of worker processes.
verbose : bool or int, optional
Return#
end_motif_freq : EndMotifFreqs
- finaletoolkit.frag.interval_end_motifs(input_file: str, refseq_file: str | Path, intervals: str | Iterable[Tuple[str, int, int, str]], k: int = 4, fraction_low: int = 10, fraction_high: int = 600, both_strands: bool = True, output_file: None | str = None, quality_threshold: int = 30, workers: int = 1, verbose: bool | int = False) EndMotifsIntervals #
Function that reads fragments from a BAM, SAM, or tabix indexed file and user-specified intervals and returns the 5’ k-mer (default is 4-mer) end motif. Optionally writes data to a tsv.
Parameters#
- input_filestr
Path of SAM, BAM, CRAM, or Frag.gz containing pair-end reads.
- refseq_filestr or Path
Path of 2bit file for reference genome that reads are aligned to.
- intervalsstr or tuple
Path of BED file containing intervals or list of tuples (chrom, start, stop, name).
- kint, optional
Length of end motif kmer. Default is 4.
- output_fileNone or str, optional
File path to write results to. Either tsv or csv.
- quality_thresholdint, optional
Minimum MAPQ to filter.
- workersint, optional
Number of worker processes.
verbose : bool or int, optional
Return#
end_motif_freq : EndMotifIntervals