Generic API¶
- class hictkpy.MultiResFile(*args, **kwargs)¶
Class representing a file handle to a .hic or .mcool file
- __init__(self, path: str | PathLike) None¶
Open a multi-resolution Cooler file (.mcool) or .hic file.
- __getitem__(self, arg: int, /) File¶
Open the Cooler or .hic file corresponding to the resolution given as input.
- __enter__(self) MultiResFile¶
- __exit__(
- self,
- exc_type: object | None = None,
- exc_value: object | None = None,
- traceback: object | None = None,
- chromosomes(self, include_ALL: bool = False) dict[str, int]¶
Get the chromosome sizes as a dictionary mapping names to sizes.
- resolutions(
- self,
Get the list of available resolutions.
- class hictkpy.File(*args, **kwargs)¶
Class representing a file handle to a .cool or .hic file.
- __init__(
- self,
- path: str | PathLike,
- resolution: int | None = None,
- matrix_type: str = 'observed',
- matrix_unit: str = 'BP',
Construct a file object to a .hic, .cool or .mcool file given the file path and resolution. Resolution is ignored when opening single-resolution Cooler files.
- __exit__(
- self,
- exc_type: object | None = None,
- exc_value: object | None = None,
- traceback: object | None = None,
- chromosomes(self, include_ALL: bool = False) dict[str, int]¶
Get chromosome sizes as a dictionary mapping names to sizes.
- fetch(
- self,
- range1: str | None = None,
- range2: str | None = None,
- normalization: str | None = None,
- count_type: type | str = 'int32',
- join: bool = False,
- query_type: str = 'UCSC',
- diagonal_band_width: int | None = None,
Fetch interactions overlapping a region of interest.
- weights( ) numpy.ndarray[dtype=float64, shape=(*), order='C'] | None¶
- weights( ) DataFrame
Overloaded function.
weights(self, name: str, divisive: bool = True) -> numpy.ndarray[dtype=float64, shape=(*), order='C'] | None
Fetch the balancing weights for the given normalization method.
weights(self, names: collections.abc.Sequence[str], divisive: bool = True) -> pandas.DataFrame
Fetch the balancing weights for the given normalization methods.Weights are returned as a pandas.DataFrame.
- class hictkpy.PixelSelector(*args, **kwargs)¶
Class representing pixels overlapping with the given genomic intervals.
- coord1(self) tuple[str, int, int] | None¶
Get query coordinates for the first dimension. Returns None when query spans the entire genome.
- coord2(self) tuple[str, int, int] | None¶
Get query coordinates for the second dimension. Returns None when query spans the entire genome.
- to_arrow(
- self,
- query_span: str = 'upper_triangle',
Retrieve interactions as a pyarrow.Table.
- to_coo( ) coo_matrix¶
Retrieve interactions as a SciPy COO matrix. When low_memory=True, the heuristic used to minimize the number of memory allocations is turned off, and a two-pass algorithm that allocates a matrix with the exact shape is used instead.
- to_csr( ) csr_matrix¶
Retrieve interactions as a SciPy CSR matrix. When low_memory=True, the heuristic used to minimize the number of memory allocations is turned off, and a two-pass algorithm that allocates a matrix with the exact shape is used instead.
- to_numpy(
- self,
- query_span: str = 'full',
Retrieve interactions as a numpy 2D matrix.
- to_pandas(
- self,
- query_span: str = 'upper_triangle',
Retrieve interactions as a pandas DataFrame.
- size(self, upper_triangular: bool = True) int¶
Get the number of pixels overlapping with the given query.
Statistics
hictkpy.PixelSelectorexposes several methods to compute or estimate several statistics efficiently.The main features of these methods are:
All statistics are computed by traversing the data only once and without caching interactions.
All methods can be tweaked to include or exclude non-finite values.
All functions implemented using short-circuiting to detect scenarios where the required statistics can be computed without traversing all pixels.
The following statistics are guaranteed to be exact:
nnz
sum
min
max
mean
The rest of the supported statistics (currently variance, skewness, and kurtosis) are estimated and are thus not guaranteed to be exact. However, in practice, the estimation is usually very accurate (relative error < 1.0e-6).
You can instruct hictkpy to compute the exact statistics by passing
exact=Truetohictkpy.PixelSelector.describe()and related methods. It should be noted that for large queries this will result in slower computations and higher memory usage.- describe(
- self,
- metrics: Sequence[str] = ['nnz', 'sum', 'min', 'max', 'mean', 'variance', 'skewness', 'kurtosis'],
- keep_nans: bool = False,
- keep_infs: bool = False,
- keep_zeros: bool = False,
- exact: bool = False,
Compute one or more descriptive metrics in the most efficient way possible. Known metrics: nnz, sum, min, max, mean, variance, skewness, kurtosis. When a metric cannot be computed (e.g. because metrics=[“variance”], but selector overlaps with a single pixel), the value for that metric is set to None. When keep_infs or keep_nans are set to True, and keep_zeros=True, nan and/or inf values are treated as zeros. By default, metrics are estimated by doing a single pass through the data. The estimates are stable and usually very accurate. However, if you require exact values, you can specify exact=True.
- kurtosis(
- self,
- keep_nans: bool = False,
- keep_infs: bool = False,
- keep_zeros: bool = False,
- exact: bool = False,
Get the kurtosis of the number of interactions for the current pixel selection. See documentation for describe() for more details.
- max( ) int | float | None¶
Get the maximum number of interactions for the current pixel selection. See documentation for describe() for more details.
- mean( ) float | None¶
Get the average number of interactions for the current pixel selection. See documentation for describe() for more details.
- min( ) int | float | None¶
Get the minimum number of interactions for the current pixel selection. See documentation for describe() for more details.
- nnz(self, keep_nans: bool = False, keep_infs: bool = False) int¶
Get the number of non-zero entries for the current pixel selection. See documentation for describe() for more details.
- skewness(
- self,
- keep_nans: bool = False,
- keep_infs: bool = False,
- keep_zeros: bool = False,
- exact: bool = False,
Get the skewness of the number of interactions for the current pixel selection. See documentation for describe() for more details.
- sum( ) int | float¶
Get the total number of interactions for the current pixel selection. See documentation for describe() for more details.
- variance(
- self,
- keep_nans: bool = False,
- keep_infs: bool = False,
- keep_zeros: bool = False,
- exact: bool = False,
Get the variance of the number of interactions for the current pixel selection. See documentation for describe() for more details.
Iteration
- __iter__(self) hictkpy.PixelIterator¶
Implement iter(self). The resulting iterator yields objects of type hictkpy.Pixel.
In [1]: import hictkpy as htk In [2]: f = htk.File("file.cool") In [3]: sel = f.fetch("chr2L:10,000,000-20,000,000") In [4]: for i, pixel in enumerate(sel): ...: print(pixel.bin1_id, pixel.bin2_id, pixel.count) ...: if i > 10: ...: break ...: 1000 1000 6759 1000 1001 3241 1000 1002 760 1000 1003 454 1000 1004 289 1000 1005 674 1000 1006 354 1000 1007 124 1000 1008 130 1000 1009 105 1000 1010 99 1000 1011 120
It is also possible to iterate over pixels together with their genomic coordinates by specifying
join=Truewhen callinghictkpy.File.fetch():In [5]: sel = f.fetch("chr2L:10,000,000-20,000,000", join=True) In [6]: for i, pixel in enumerate(sel): ...: print( ...: pixel.chrom1, pixel.start1, pixel.end1, ...: pixel.chrom2, pixel.start2, pixel.end2, ...: pixel.count ...: ) ...: if i > 10: ...: break ...: chr2L 10000000 10010000 chr2L 10000000 10010000 6759 chr2L 10000000 10010000 chr2L 10010000 10020000 3241 chr2L 10000000 10010000 chr2L 10020000 10030000 760 chr2L 10000000 10010000 chr2L 10030000 10040000 454 chr2L 10000000 10010000 chr2L 10040000 10050000 289 chr2L 10000000 10010000 chr2L 10050000 10060000 674 chr2L 10000000 10010000 chr2L 10060000 10070000 354 chr2L 10000000 10010000 chr2L 10070000 10080000 124 chr2L 10000000 10010000 chr2L 10080000 10090000 130 chr2L 10000000 10010000 chr2L 10090000 10100000 105 chr2L 10000000 10010000 chr2L 10100000 10110000 99 chr2L 10000000 10010000 chr2L 10110000 10120000 120
- class hictkpy.Bin¶
Class representing a genomic Bin (i.e., a BED interval).
- property id¶
Get the bin ID.
- property rel_id¶
Get the relative bin ID (i.e., the ID that uniquely identifies a bin within a chromosome).
- property chrom¶
Get the name of the chromosome to which the Bin refers to.
- property start¶
Get the Bin start position.
- property end¶
Get the Bin end position.
- class hictkpy.BinTable(*args, **kwargs)¶
Class representing a table of genomic bins.
- __init__(self, chroms: dict[str, int], resolution: int) None¶
- __init__(self, bins: DataFrame) None
Overloaded function.
__init__(self, chroms: dict[str, int], resolution: int) -> None
Construct a table of bins given a dictionary mapping chromosomes to their sizes and a resolution.
__init__(self, bins: pandas.DataFrame) -> None
Construct a table of bins from a pandas.DataFrame with columns [“chrom”, “start”, “end”].
- chromosomes(self, include_ALL: bool = False) dict[str, int]¶
Get the chromosome sizes as a dictionary mapping names to sizes.
- get(self, bin_id: int) Bin¶
- get(self, bin_ids: Sequence[int]) DataFrame
- get(self, chrom: str, pos: int) Bin
- get( ) DataFrame
Overloaded function.
get(self, bin_id: int) -> hictkpy.Bin
Get the genomic coordinate given a bin ID.
get(self, bin_ids: collections.abc.Sequence[int]) -> pandas.DataFrame
Get the genomic coordinates given a sequence of bin IDs. Genomic coordinates are returned as a pandas.DataFrame with columns [“chrom”, “start”, “end”].
get(self, chrom: str, pos: int) -> hictkpy.Bin
Get the bin overlapping the given genomic coordinate.
get(self, chroms: collections.abc.Sequence[str], pos: collections.abc.Sequence[int]) -> pandas.DataFrame
Get the bins overlapping the given genomic coordinates. Bins are returned as a pandas.DataFrame with columns [“chrom”, “start”, “end”].
- get_id(self, chrom: str, pos: int) int¶
Get the ID of the bin overlapping the given genomic coordinate.
- get_ids( ) numpy.ndarray[dtype=int64, shape=(*)]¶
Get the IDs of the bins overlapping the given genomic coordinates.
- merge(self, df: DataFrame) DataFrame¶
Merge genomic coordinates corresponding to the given bin identifiers. Bin identifiers should be provided as a pandas.DataFrame with columns “bin1_id” and “bin2_id”. Genomic coordinates are returned as a pandas.DataFrame containing the same data as the DataFrame given as input, plus columns [“chrom1”, “start1”, “end1”, “chrom2”, “start2”, “end2”].
- resolution(self) int¶
Get the bin size for the bin table. Return 0 in case the bin table has a variable bin size.
- to_arrow( ) Table¶
Return the bins in the BinTable as a pyarrow.Table. The optional “range” parameter can be used to only fetch a subset of the bins in the BinTable.
- to_pandas( ) DataFrame¶
Return the bins in the BinTable as a pandas.DataFrame. The optional “range” parameter can be used to only fetch a subset of the bins in the BinTable.
- __iter__(self) hictkpy.BinTableIterator¶
Implement iter(self). The resulting iterator yields objects of type hictkpy.Bin.
- class hictkpy.Pixel(*args, **kwargs)¶
Class modeling a Pixel in COO or BG2 format.
- property bin1_id¶
Get the ID of bin1.
- property bin2_id¶
Get the ID of bin2.
- property count¶
Get the number of interactions.
The following properties are only available when pixels are in BG2 format.
- property bin1¶
Get bin1.
- property bin2¶
Get bin2.
- property chrom1¶
Get the chromosome associated with bin1.
- property start1¶
Get the start position associated with bin1.
- property end1¶
Get the end position associated with bin1.
- property chrom2¶
Get the chromosome associated with bin2.
- property start2¶
Get the start position associated with bin2.
- property end2¶
Get the end position associated with bin2.