Generic API¶

hictkpy.is_cooler(path: str | PathLike) → bool¶: Test whether path points to a cooler file.

hictkpy.is_mcool_file(path: str | PathLike) → bool¶: Test whether path points to a .mcool file.

hictkpy.is_scool_file(path: str | PathLike) → bool¶: Test whether path points to a .scool file.

hictkpy.is_hic(path: str | PathLike) → bool¶: Test whether path points to a .hic file.

class hictkpy.MultiResFile(*args, **kwargs)¶

Class representing a file handle to a .hic or .mcool file

__init__(self, path: str | PathLike) → None¶: Open a multi-resolution Cooler file (.mcool) or .hic file.

__getitem__(self, arg: int, /) → File¶: Open the Cooler or .hic file corresponding to the resolution given as input.

__enter__(self) → MultiResFile¶

__exit__( self, exc_type: object | None = None, exc_value: object | None = None, traceback: object | None = None, ) → None¶

attributes(self) → dict¶: Get file attributes as a dictionary.

chromosomes(self, include_ALL: bool = False) → dict[str, int]¶: Get the chromosome sizes as a dictionary mapping names to sizes.

close(self) → None¶: Manually close the file handle.

is_hic(self) → bool¶: Test whether the file is in .hic format.

is_mcool(self) → bool¶: Test whether the file is in .mcool format.

path(self) → Path¶: Get the file path.

resolutions( self, ) → numpy.ndarray[dtype=int64, shape=(*), order='C']¶: Get the list of available resolutions.

class hictkpy.File(*args, **kwargs)¶

Class representing a file handle to a .cool or .hic file.

__init__( self, path: str | PathLike, resolution: int | None = None, matrix_type: str = 'observed', matrix_unit: str = 'BP', ) → None¶: Construct a file object to a .hic, .cool or .mcool file given the file path and resolution. Resolution is ignored when opening single-resolution Cooler files.

__enter__(self) → File¶

__exit__( self, exc_type: object | None = None, exc_value: object | None = None, traceback: object | None = None, ) → None¶

attributes(self) → dict¶: Get file attributes as a dictionary.

avail_normalizations(self) → list[str]¶: Get the list of available normalizations.

bins(self) → BinTable¶: Get table of bins.

chromosomes(self, include_ALL: bool = False) → dict[str, int]¶: Get chromosome sizes as a dictionary mapping names to sizes.

close(self) → None¶: Manually close the file handle.

fetch( self, range1: str | None = None, range2: str | None = None, normalization: str | None = None, count_type: type | str = 'int32', join: bool = False, query_type: str = 'UCSC', diagonal_band_width: int | None = None, ) → PixelSelector¶: Fetch interactions overlapping a region of interest.

has_normalization(self, normalization: str) → bool¶: Check whether a given normalization is available.

is_cooler(self) → bool¶: Test whether file is in .cool format.

is_hic(self) → bool¶: Test whether file is in .hic format.

nbins(self) → int¶: Get the total number of bins.

nchroms(self, include_ALL: bool = False) → int¶: Get the total number of chromosomes.

path(self) → Path¶: Return the file path.

resolution(self) → int¶: Get the bin size in bp.

uri(self) → str¶: Return the file URI.

weights( self, name: str, divisive: bool = True, ) → numpy.ndarray[dtype=float64, shape=(*), order='C'] | None¶

weights( self, names: Sequence[str], divisive: bool = True, ) → DataFrame

Overloaded function.

weights(self, name: str, divisive: bool = True) -> numpy.ndarray[dtype=float64, shape=(*), order='C'] | None

Fetch the balancing weights for the given normalization method.

weights(self, names: collections.abc.Sequence[str], divisive: bool = True) -> pandas.DataFrame

Fetch the balancing weights for the given normalization methods.Weights are returned as a pandas.DataFrame.

class hictkpy.PixelSelector(*args, **kwargs)¶

Class representing pixels overlapping with the given genomic intervals.

coord1(self) → tuple[str, int, int] | None¶: Get query coordinates for the first dimension. Returns None when query spans the entire genome.

coord2(self) → tuple[str, int, int] | None¶: Get query coordinates for the second dimension. Returns None when query spans the entire genome.

dtype(self) → type¶: Get the dtype for the pixel count.

to_arrow( self, query_span: str = 'upper_triangle', ) → Table¶: Retrieve interactions as a pyarrow.Table.

to_coo( self, query_span: str = 'upper_triangle', low_memory: bool = False, ) → coo_matrix¶: Retrieve interactions as a SciPy COO matrix. When low_memory=True, the heuristic used to minimize the number of memory allocations is turned off, and a two-pass algorithm that allocates a matrix with the exact shape is used instead.

to_csr( self, query_span: str = 'upper_triangle', low_memory: bool = False, ) → csr_matrix¶: Retrieve interactions as a SciPy CSR matrix. When low_memory=True, the heuristic used to minimize the number of memory allocations is turned off, and a two-pass algorithm that allocates a matrix with the exact shape is used instead.

to_df( self, query_span: str = 'upper_triangle', ) → DataFrame¶: Alias to to_pandas().

to_numpy( self, query_span: str = 'full', ) → numpy.ndarray[shape=(*, *), order='C']¶: Retrieve interactions as a numpy 2D matrix.

to_pandas( self, query_span: str = 'upper_triangle', ) → DataFrame¶: Retrieve interactions as a pandas DataFrame.

size(self, upper_triangular: bool = True) → int¶: Get the number of pixels overlapping with the given query.

Statistics

hictkpy.PixelSelector exposes several methods to compute or estimate several statistics efficiently.

The main features of these methods are:

All statistics are computed by traversing the data only once and without caching interactions.
All methods can be tweaked to include or exclude non-finite values.
All functions implemented using short-circuiting to detect scenarios where the required statistics can be computed without traversing all pixels.

The following statistics are guaranteed to be exact:

nnz
sum
min
max
mean

The rest of the supported statistics (currently variance, skewness, and kurtosis) are estimated and are thus not guaranteed to be exact. However, in practice, the estimation is usually very accurate (relative error < 1.0e-6).

You can instruct hictkpy to compute the exact statistics by passing exact=True to hictkpy.PixelSelector.describe() and related methods. It should be noted that for large queries this will result in slower computations and higher memory usage.

describe( self, metrics: Sequence[str] = ['nnz', 'sum', 'min', 'max', 'mean', 'variance', 'skewness', 'kurtosis'], keep_nans: bool = False, keep_infs: bool = False, keep_zeros: bool = False, exact: bool = False, ) → dict¶: Compute one or more descriptive metrics in the most efficient way possible. Known metrics: nnz, sum, min, max, mean, variance, skewness, kurtosis. When a metric cannot be computed (e.g. because metrics=[“variance”], but selector overlaps with a single pixel), the value for that metric is set to None. When keep_infs or keep_nans are set to True, and keep_zeros=True, nan and/or inf values are treated as zeros. By default, metrics are estimated by doing a single pass through the data. The estimates are stable and usually very accurate. However, if you require exact values, you can specify exact=True.

kurtosis( self, keep_nans: bool = False, keep_infs: bool = False, keep_zeros: bool = False, exact: bool = False, ) → float | None¶: Get the kurtosis of the number of interactions for the current pixel selection. See documentation for describe() for more details.

max( self, keep_nans: bool = False, keep_infs: bool = False, keep_zeros: bool = False, ) → int | float | None¶: Get the maximum number of interactions for the current pixel selection. See documentation for describe() for more details.

mean( self, keep_nans: bool = False, keep_infs: bool = False, keep_zeros: bool = False, ) → float | None¶: Get the average number of interactions for the current pixel selection. See documentation for describe() for more details.

min( self, keep_nans: bool = False, keep_infs: bool = False, keep_zeros: bool = False, ) → int | float | None¶: Get the minimum number of interactions for the current pixel selection. See documentation for describe() for more details.

nnz(self, keep_nans: bool = False, keep_infs: bool = False) → int¶: Get the number of non-zero entries for the current pixel selection. See documentation for describe() for more details.

skewness( self, keep_nans: bool = False, keep_infs: bool = False, keep_zeros: bool = False, exact: bool = False, ) → float | None¶: Get the skewness of the number of interactions for the current pixel selection. See documentation for describe() for more details.

sum( self, keep_nans: bool = False, keep_infs: bool = False, ) → int | float¶: Get the total number of interactions for the current pixel selection. See documentation for describe() for more details.

variance( self, keep_nans: bool = False, keep_infs: bool = False, keep_zeros: bool = False, exact: bool = False, ) → float | None¶: Get the variance of the number of interactions for the current pixel selection. See documentation for describe() for more details.

Iteration

__iter__(self) → hictkpy.PixelIterator¶

Implement iter(self). The resulting iterator yields objects of type hictkpy.Pixel.

In [1]: import hictkpy as htk

In [2]: f = htk.File("file.cool")

In [3]: sel = f.fetch("chr2L:10,000,000-20,000,000")

In [4]: for i, pixel in enumerate(sel):
   ...:     print(pixel.bin1_id, pixel.bin2_id, pixel.count)
   ...:     if i > 10:
   ...:         break
   ...:
1000 1000 6759
1000 1001 3241
1000 1002 760
1000 1003 454
1000 1004 289
1000 1005 674
1000 1006 354
1000 1007 124
1000 1008 130
1000 1009 105
1000 1010 99
1000 1011 120

It is also possible to iterate over pixels together with their genomic coordinates by specifying join=True when calling hictkpy.File.fetch():

In [5]: sel = f.fetch("chr2L:10,000,000-20,000,000", join=True)

In [6]: for i, pixel in enumerate(sel):
   ...:     print(
   ...:         pixel.chrom1, pixel.start1, pixel.end1,
   ...:         pixel.chrom2, pixel.start2, pixel.end2,
   ...:         pixel.count
   ...:     )
   ...:     if i > 10:
   ...:         break
   ...:
chr2L 10000000 10010000 chr2L 10000000 10010000 6759
chr2L 10000000 10010000 chr2L 10010000 10020000 3241
chr2L 10000000 10010000 chr2L 10020000 10030000 760
chr2L 10000000 10010000 chr2L 10030000 10040000 454
chr2L 10000000 10010000 chr2L 10040000 10050000 289
chr2L 10000000 10010000 chr2L 10050000 10060000 674
chr2L 10000000 10010000 chr2L 10060000 10070000 354
chr2L 10000000 10010000 chr2L 10070000 10080000 124
chr2L 10000000 10010000 chr2L 10080000 10090000 130
chr2L 10000000 10010000 chr2L 10090000 10100000 105
chr2L 10000000 10010000 chr2L 10100000 10110000 99
chr2L 10000000 10010000 chr2L 10110000 10120000 120

class hictkpy.Bin¶

Class representing a genomic Bin (i.e., a BED interval).

property id¶: Get the bin ID.

property rel_id¶: Get the relative bin ID (i.e., the ID that uniquely identifies a bin within a chromosome).

property chrom¶: Get the name of the chromosome to which the Bin refers to.

property start¶: Get the Bin start position.

property end¶: Get the Bin end position.

class hictkpy.BinTable(*args, **kwargs)¶

Class representing a table of genomic bins.

__init__(self, chroms: dict[str, int], resolution: int) → None¶

__init__(self, bins: DataFrame) → None

Overloaded function.

__init__(self, chroms: dict[str, int], resolution: int) -> None

Construct a table of bins given a dictionary mapping chromosomes to their sizes and a resolution.

__init__(self, bins: pandas.DataFrame) -> None

Construct a table of bins from a pandas.DataFrame with columns [“chrom”, “start”, “end”].

chromosomes(self, include_ALL: bool = False) → dict[str, int]¶: Get the chromosome sizes as a dictionary mapping names to sizes.

get(self, bin_id: int) → Bin¶

get(self, bin_ids: Sequence[int]) → DataFrame

get(self, chrom: str, pos: int) → Bin

get( self, chroms: Sequence[str], pos: Sequence[int], ) → DataFrame

Overloaded function.

get(self, bin_id: int) -> hictkpy.Bin

Get the genomic coordinate given a bin ID.

get(self, bin_ids: collections.abc.Sequence[int]) -> pandas.DataFrame

Get the genomic coordinates given a sequence of bin IDs. Genomic coordinates are returned as a pandas.DataFrame with columns [“chrom”, “start”, “end”].

get(self, chrom: str, pos: int) -> hictkpy.Bin

Get the bin overlapping the given genomic coordinate.

get(self, chroms: collections.abc.Sequence[str], pos: collections.abc.Sequence[int]) -> pandas.DataFrame

Get the bins overlapping the given genomic coordinates. Bins are returned as a pandas.DataFrame with columns [“chrom”, “start”, “end”].

get_id(self, chrom: str, pos: int) → int¶: Get the ID of the bin overlapping the given genomic coordinate.

get_ids( self, chroms: Sequence[str], pos: Sequence[int], ) → numpy.ndarray[dtype=int64, shape=(*)]¶: Get the IDs of the bins overlapping the given genomic coordinates.

merge(self, df: DataFrame) → DataFrame¶: Merge genomic coordinates corresponding to the given bin identifiers. Bin identifiers should be provided as a pandas.DataFrame with columns “bin1_id” and “bin2_id”. Genomic coordinates are returned as a pandas.DataFrame containing the same data as the DataFrame given as input, plus columns [“chrom1”, “start1”, “end1”, “chrom2”, “start2”, “end2”].

resolution(self) → int¶: Get the bin size for the bin table. Return 0 in case the bin table has a variable bin size.

to_arrow( self, range: str | None = None, query_type: str = 'UCSC', ) → Table¶: Return the bins in the BinTable as a pyarrow.Table. The optional “range” parameter can be used to only fetch a subset of the bins in the BinTable.

to_df( self, range: str | None = None, query_type: str = 'UCSC', ) → DataFrame¶: Alias to to_pandas().

to_pandas( self, range: str | None = None, query_type: str = 'UCSC', ) → DataFrame¶: Return the bins in the BinTable as a pandas.DataFrame. The optional “range” parameter can be used to only fetch a subset of the bins in the BinTable.

type(self) → str¶: Get the type of table underlying the BinTable object (i.e. fixed or variable).

__iter__(self) → hictkpy.BinTableIterator¶: Implement iter(self). The resulting iterator yields objects of type hictkpy.Bin.

class hictkpy.Pixel(*args, **kwargs)¶

Class modeling a Pixel in COO or BG2 format.

property bin1_id¶: Get the ID of bin1.

property bin2_id¶: Get the ID of bin2.

property count¶: Get the number of interactions.

The following properties are only available when pixels are in BG2 format.

property bin1¶: Get bin1.

property bin2¶: Get bin2.

property chrom1¶: Get the chromosome associated with bin1.

property start1¶: Get the start position associated with bin1.

property end1¶: Get the end position associated with bin1.

property chrom2¶: Get the chromosome associated with bin2.

property start2¶: Get the start position associated with bin2.

property end2¶: Get the end position associated with bin2.