Genome Utilities#

class finaletoolkit.genome.GenomeGaps(gaps_bed: Path | str = None)#

Reads telomere, centromere, and short_arm intervals from a bed file or generates these intervals from UCSC gap and centromere tracks for hg19 and hg38.

classmethod b37()#

Creates a GenomeGaps for the Broad Institute GRCh37 reference genome i.e b37. This reference genome is also based on GRCh37, but differs from the UCSC hg19 reference in a few ways, including the absence of the ‘chr’ prefix. We generate this GenomeGap using an ad hoc method where we take the UCSC hg19 gap track and drop ‘chr’ from the chromosome names. Because there are other differences between hg19 and b37, this is not a perfect solution.

Returns#

gapsGenomeGaps

GenomeGaps for the b37 reference genome.

get_arm(contig: str, start: int, stop: int) str#

Returns the chromosome arm the interval is in. If in the short arm of an acrocentric chromosome or intersects a centromere, returns an empty string.

contigstr

Chromosome of interval.

startint

Start of interval.

stopint

End of interval.

Returns#

armstr

Arm that interval is in.

Raises#

ValueError

Raised for invalid coordinates

get_contig_gaps(contig: str) ContigGaps#

Creates a ContigGaps for the specified chromosome

Parameters#

contigstr

Chromosome to make ContigGaps for

Returns#

ContigGaps

Contains centromere and telomere intervals for chromosome

classmethod hg38()#

Creates a GenomeGaps for the hg38 reference genome. This sequences uses chromosome names that start with ‘chr’ and is synonymous with the GRCh38 reference genome. Returns ——- gaps : GenomeGaps

GenomeGaps for the hg38 reference genome.

in_tcmere(contig: str, start: int, stop: int) bool#

Checks if specified interval is in a centromere or telomere

Parameters#

contigstr

Chromosome name

startint

Start of interval

stopint

End of interval

Returns#

in_telomere_or_centromerebool

True if in a centromere or telomere

overlaps_gap(contig: str, start: int, stop: int) bool#

Checks if specified interval overlaps a gap interval

Parameters#

contigstr

Chromosome name

startint

Start of interval

stopint

End of interval

Returns#

in_telomere_or_centromerebool

True if in a centromere or telomere

to_bed(output_file: str)#

Prints gap intervals in GenomeGaps to a BED4 file where the name is the type of gap interval.

Parameters#

output_filestr

File to write to. Optionally gzipped. If output_file == ‘-‘, results will be writted to stdout.

classmethod ucsc_hg19()#

Creates a GenomeGaps for the UCSC hg19 reference genome. This sequences uses chromosome names that start with ‘chr’ and is based on a version of the GRCh37 reference genome.

Returns#

gapsGenomeGaps

GenomeGaps for the UCSC hg19 reference genome.

class finaletoolkit.genome.ContigGaps(contig: str, centromere: Tuple[int, int], telomeres: Iterable[Tuple[int, int]], has_short_arm: bool = False)#
get_arm(start: int, stop: int)#

Returns name of chromosome arm the interval is in. Returns “NOARM” if in a centromere, telomere, or short arm of an acrocentric chromosome.

Parameters#

startint

Start of interval.

stopint

End of interval.

Returns#

str

Name of the chromosome arm.

Raises#

ValueError

Raised if invalid coordinates are given.

in_gap(start: int, stop: int) bool#

Checks if specified interval is in a gap.

Parameters#

startint

Start of interval.

stopint

End of Interval.

Returns#

bool

True if there is an overlap.

in_tcmere(start: int, stop: int) bool#

Checks if specified interval is in a centromere or telomere.

Parameters#

startint

Start of interval.

stopint

End of Interval.

Returns#

bool

True if there is an overlap.

finaletoolkit.genome.ucsc_hg19_gap_bed(output_file: str)#

Creates BED4 of centromeres, telomeres, and short arms for the UCSC hg19 reference sequence.

Parameters#

output_filestr

Output path

finaletoolkit.genome.b37_gap_bed(output_file: str)#

Creates BED4 of centromeres, telomeres, and short arms for the Broad Institute GRCh37 (b37) reference sequence. Also useful for files aligned to human_g1k_v37 (1000 Genomes Project).

Parameters#

output_filestr

Output path

finaletoolkit.genome.ucsc_hg38_gap_bed(output_file: str)#

Creates BED4 of centromeres, telomeres, and short arms for the UCSC hg38 reference sequence.

Parameters#

output_filestr

Output path