Genome Utilities#

class finaletoolkit.genome.GenomeGaps(gaps_bed: Path | str | None = None)#

Reads telomere, centromere, and short_arm intervals from a bed file or generates these intervals from UCSC gap and centromere tracks for hg19 and hg38.

classmethod b37()#

Creates a GenomeGaps for the Broad Institute GRCh37 reference genome i.e b37. This reference genome is also based on GRCh37, but differs from the UCSC hg19 reference in a few ways, including the absence of the ‘chr’ prefix. We generate this GenomeGap using an ad hoc method where we take the UCSC hg19 gap track and drop ‘chr’ from the chromosome names. Because there are other differences between hg19 and b37, this is not a perfect solution.

Returns:

gaps – GenomeGaps for the b37 reference genome.

Return type:

GenomeGaps

get_arm(contig: str, start: int, stop: int) str#

Returns the chromosome arm the interval is in. If in the short arm of an acrocentric chromosome or intersects a centromere, returns an empty string.

contigstr

Chromosome of interval.

startint

Start of interval.

stopint

End of interval.

Returns:

arm – Arm that interval is in.

Return type:

str

Raises:

ValueError – Raised for invalid coordinates

get_contig_gaps(contig: str) ContigGaps#

Creates a ContigGaps for the specified chromosome

Parameters:

contig (str) – Chromosome to make ContigGaps for

Returns:

Contains centromere and telomere intervals for chromosome

Return type:

ContigGaps

classmethod hg38()#

Creates a GenomeGaps for the hg38 reference genome. This sequences uses chromosome names that start with ‘chr’ and is synonymous with the GRCh38 reference genome. :returns: gaps – GenomeGaps for the hg38 reference genome. :rtype: GenomeGaps

in_tcmere(contig: str, start: int, stop: int) bool#

Checks if specified interval is in a centromere or telomere

Parameters:
  • contig (str) – Chromosome name

  • start (int) – Start of interval

  • stop (int) – End of interval

Returns:

in_telomere_or_centromere – True if in a centromere or telomere

Return type:

bool

overlaps_gap(contig: str, start: int, stop: int) bool#

Checks if specified interval overlaps a gap interval

Parameters:
  • contig (str) – Chromosome name

  • start (int) – Start of interval

  • stop (int) – End of interval

Returns:

in_telomere_or_centromere – True if in a centromere or telomere

Return type:

bool

to_bed(output_file: str)#

Prints gap intervals in GenomeGaps to a BED4 file where the name is the type of gap interval.

Parameters:

output_file (str) – File to write to. Optionally gzipped. If output_file == ‘-‘, results will be writted to stdout.

classmethod ucsc_hg19()#

Creates a GenomeGaps for the UCSC hg19 reference genome. This sequences uses chromosome names that start with ‘chr’ and is based on a version of the GRCh37 reference genome.

Returns:

gaps – GenomeGaps for the UCSC hg19 reference genome.

Return type:

GenomeGaps

class finaletoolkit.genome.ContigGaps(contig: str, centromere: tuple[int, int], telomeres: Iterable[tuple[int, int]], has_short_arm: bool = False)#
get_arm(start: int, stop: int)#

Returns name of chromosome arm the interval is in. Returns “NOARM” if in a centromere, telomere, or short arm of an acrocentric chromosome.

Parameters:
  • start (int) – Start of interval.

  • stop (int) – End of interval.

Returns:

Name of the chromosome arm.

Return type:

str

Raises:

ValueError – Raised if invalid coordinates are given.

in_gap(start: int, stop: int) bool#

Checks if specified interval is in a gap.

Parameters:
  • start (int) – Start of interval.

  • stop (int) – End of Interval.

Returns:

True if there is an overlap.

Return type:

bool

in_tcmere(start: int, stop: int) bool#

Checks if specified interval is in a centromere or telomere.

Parameters:
  • start (int) – Start of interval.

  • stop (int) – End of Interval.

Returns:

True if there is an overlap.

Return type:

bool

finaletoolkit.genome.ucsc_hg19_gap_bed(output_file: str)#

Creates BED4 of centromeres, telomeres, and short arms for the UCSC hg19 reference sequence.

Parameters:

output_file (str) – Output path

finaletoolkit.genome.b37_gap_bed(output_file: str)#

Creates BED4 of centromeres, telomeres, and short arms for the Broad Institute GRCh37 (b37) reference sequence. Also useful for files aligned to human_g1k_v37 (1000 Genomes Project).

Parameters:

output_file (str) – Output path

finaletoolkit.genome.ucsc_hg38_gap_bed(output_file: str)#

Creates BED4 of centromeres, telomeres, and short arms for the UCSC hg38 reference sequence.

Parameters:

output_file (str) – Output path