Genome Utilities#
- class finaletoolkit.genome.GenomeGaps(gaps_bed: PathLike | str | None = None)#
Reads telomere, centromere, and short_arm intervals from a bed file or generates these intervals from UCSC gap and centromere tracks for hg19 and hg38.
- classmethod b37()#
Creates a GenomeGaps for the Broad Institute GRCh37 reference genome i.e b37. This reference genome is also based on GRCh37, but differs from the UCSC hg19 reference in a few ways, including the absence of the ‘chr’ prefix. We generate this GenomeGap using an ad hoc method where we take the UCSC hg19 gap track and drop ‘chr’ from the chromosome names. Because there are other differences between hg19 and b37, this is not a perfect solution.
- Returns:
gaps – GenomeGaps for the b37 reference genome.
- Return type:
- get_arm(contig: str, start: int, stop: int) str #
Returns the chromosome arm the interval is in. If in the short arm of an acrocentric chromosome or intersects a centromere, returns an empty string.
- contigstr
Chromosome of interval.
- startint
Start of interval.
- stopint
End of interval.
- Returns:
arm – Arm that interval is in.
- Return type:
str
- Raises:
ValueError – Raised for invalid coordinates
- get_contig_gaps(contig: str) ContigGaps #
Creates a ContigGaps for the specified chromosome
- Parameters:
contig (str) – Chromosome to make ContigGaps for
- Returns:
Contains centromere and telomere intervals for chromosome
- Return type:
- classmethod hg38()#
Creates a GenomeGaps for the hg38 reference genome. This sequences uses chromosome names that start with ‘chr’ and is synonymous with the GRCh38 reference genome. :returns: gaps – GenomeGaps for the hg38 reference genome. :rtype: GenomeGaps
- in_tcmere(contig: str, start: int, stop: int) bool #
Checks if specified interval is in a centromere or telomere
- Parameters:
contig (str) – Chromosome name
start (int) – Start of interval
stop (int) – End of interval
- Returns:
in_telomere_or_centromere – True if in a centromere or telomere
- Return type:
bool
- overlaps_gap(contig: str, start: int, stop: int) bool #
Checks if specified interval overlaps a gap interval
- Parameters:
contig (str) – Chromosome name
start (int) – Start of interval
stop (int) – End of interval
- Returns:
in_telomere_or_centromere – True if in a centromere or telomere
- Return type:
bool
- to_bed(output_file: str | PathLike)#
Prints gap intervals in GenomeGaps to a BED4 file where the name is the type of gap interval.
- Parameters:
output_file (str or path) – File to write to. Optionally gzipped. If output_file == ‘-‘, results will be writted to stdout.
- classmethod ucsc_hg19()#
Creates a GenomeGaps for the UCSC hg19 reference genome. This sequences uses chromosome names that start with ‘chr’ and is based on a version of the GRCh37 reference genome.
- Returns:
gaps – GenomeGaps for the UCSC hg19 reference genome.
- Return type:
- class finaletoolkit.genome.ContigGaps(contig: str, centromere: tuple[int, int], telomeres: Iterable[tuple[int, int]], has_short_arm: bool = False)#
- get_arm(start: int, stop: int)#
Returns name of chromosome arm the interval is in. Returns “NOARM” if in a centromere, telomere, or short arm of an acrocentric chromosome.
- Parameters:
start (int) – Start of interval.
stop (int) – End of interval.
- Returns:
Name of the chromosome arm.
- Return type:
str
- Raises:
ValueError – Raised if invalid coordinates are given.
- in_gap(start: int, stop: int) bool #
Checks if specified interval is in a gap.
- Parameters:
start (int) – Start of interval.
stop (int) – End of Interval.
- Returns:
True if there is an overlap.
- Return type:
bool
- in_tcmere(start: int, stop: int) bool #
Checks if specified interval is in a centromere or telomere.
- Parameters:
start (int) – Start of interval.
stop (int) – End of Interval.
- Returns:
True if there is an overlap.
- Return type:
bool
- finaletoolkit.genome.ucsc_hg19_gap_bed(output_file: str | PathLike)#
Creates BED4 of centromeres, telomeres, and short arms for the UCSC hg19 reference sequence.
- Parameters:
output_file (str or path) – Output path
- finaletoolkit.genome.b37_gap_bed(output_file: str | PathLike)#
Creates BED4 of centromeres, telomeres, and short arms for the Broad Institute GRCh37 (b37) reference sequence. Also useful for files aligned to human_g1k_v37 (1000 Genomes Project).
- Parameters:
output_file (str or path) – Output path
- finaletoolkit.genome.ucsc_hg38_gap_bed(output_file: str | PathLike)#
Creates BED4 of centromeres, telomeres, and short arms for the UCSC hg38 reference sequence.
- Parameters:
output_file (str or path) – Output path