Quickstart¶

hictkpy provides Python bindings for hictk through nanobind.

hictkpy.File can open .cool and .hic files and can be used to fetch interactions as well as file metadata.

The examples in this section use the file 4DNFIOTPSS3L.hic, which can be downloaded from the 4D Nucleome Data Portal here.

Opening files¶

In [1]: import hictkpy as htk

# .mcool and .cool files are also supported
In [2]: f = htk.File("4DNFIOTPSS3L.hic", 10_000)

In [3]: f.path()
Out[3]: '4DNFIOTPSS3L.hic'

Important

The above example assigns the hictkpy.File directly to variable f for simplicity. Always prefer using context managers (e.g., the with keyword) when opening files using hictkpy:

with htk.File("4DNFIOTPSS3L.hic", 10_000) as f:
  # use the file

Reading file metadata¶

In [4]: f.resolution()
Out[4]: 10000

In [5]: f.chromosomes()
Out[5]:
{'2L': 23513712,
 '2R': 25286936,
 '3L': 28110227,
 '3R': 32079331,
 '4': 1348131,
 'X': 23542271,
 'Y': 3667352}

In [6]: f.attributes()
Out[6]:
{'bin_size': 10000,
 'format': 'HIC',
 'format_version': 8,
 'assembly': '/var/lib/cwl/stgb25a903a-ebb6-4a56-bf3f-90bd84a40bf4/4DNFIBEEN92C.chrom.sizes',
 'format-url': 'https://github.com/aidenlab/hic-format',
 'nbins': 13758,
 'nchroms': 7}

Fetch interactions¶

Interactions can be fetched by calling the hictkpy.File.fetch() method on hictkpy.File objects.

hictkpy.File.fetch() returns hictkpy.PixelSelector objects, which are very cheap to create.

# Fetch all interactions (genome-wide query) in COO format (row, column, count)
In [7]: sel = f.fetch()

# Fetch all interactions (genome-wide query) in bedgraph2 format
In [8]: sel = f.fetch(join=True)

# Fetch KR-normalized interactions
In [9]: sel = f.fetch(normalization="KR")

# Fetch interactions for a region of interest
In [10]: sel = f.fetch("2L:10,000,000-20,000,000")

In [11]: sel = f.fetch("2L:10,000,000-20,000,000", "X")

In [12]: sel.nnz()
Out[12]: 2247057

In [13]: sel.sum()
Out[13]: 7163361

Fetching interactions as pandas DataFrames¶

In [13]: sel = f.fetch("2L:10,000,000-20,000,000", join=True)

In [14]: sel.to_df()
Out[14]:
       chrom1    start1      end1 chrom2    start2      end2  count
0          2L  10000000  10010000     2L  10000000  10010000   6759
1          2L  10000000  10010000     2L  10010000  10020000   3241
2          2L  10000000  10010000     2L  10020000  10030000    760
3          2L  10000000  10010000     2L  10030000  10040000    454
4          2L  10000000  10010000     2L  10040000  10050000    289
...       ...       ...       ...    ...       ...       ...    ...
339036     2L  19970000  19980000     2L  19980000  19990000    407
339037     2L  19970000  19980000     2L  19990000  20000000    221
339038     2L  19980000  19990000     2L  19980000  19990000    391
339039     2L  19980000  19990000     2L  19990000  20000000    252
339040     2L  19990000  20000000     2L  19990000  20000000    266

[339041 rows x 7 columns]

Fetching interactions as scipy.sparse.csr_matrix¶

In [15]: sel = f.fetch("2L:10,000,000-20,000,000")

In [16]: sel.to_csr()
Out[16]:
<Compressed Sparse Row sparse matrix of dtype 'int32'
        with 339041 stored elements and shape (1000, 1000)>

Fetching interactions as numpy NDArray¶

In [17]: sel = f.fetch("2L:10,000,000-20,000,000")

In [18]: m = sel.to_numpy()

In [19]: import matplotlib.pyplot as plt

In [20]: from matplotlib.colors import LogNorm

In [21]: plt.imshow(m, norm=LogNorm())

In [22]: plt.show()

Fetching other types of data¶

Fetching the table of bins as pandas.DataFrame:

In [23]: f.bins()
Out[23]:
      chrom    start      end
0        2L        0    10000
1        2L    10000    20000
2        2L    20000    30000
3        2L    30000    40000
4        2L    40000    50000
...     ...      ...      ...
13753     Y  3620000  3630000
13754     Y  3630000  3640000
13755     Y  3640000  3650000
13756     Y  3650000  3660000
13757     Y  3660000  3667352

[13758 rows x 3 columns]

Fetching balancing weights:

In [24]: import pandas as pd

In [25]: weights = {}
    ...: for norm in f.avail_normalizations():
    ...:     weights[norm] = f.weights(norm)
    ...: weights = pd.DataFrame(weights)
    ...: weights
Out[25]:
             KR        VC   VC_SQRT
0      0.582102  0.666016  0.759389
1      1.300415  1.496604  1.138349
2      1.180977  1.470464  1.128364
3      1.007625  1.266340  1.047122
4      1.175642  1.492664  1.136850
...         ...       ...       ...
13753       NaN  0.000000  0.000000
13754       NaN  0.000000  0.000000
13755       NaN  0.000000  0.000000
13756  1.155544  2.234906  0.631055
13757       NaN  0.069841  0.111556

[13758 rows x 3 columns]

Efficiently compute descriptive statistics¶

hictkpy supports computing common descriptive statistics without reading interactions into memory (and without traversing the data more than once).

Compute all supported statistics at once:

In [26]: f.fetch().describe()
Out[26]:
{'nnz': 18122793,
 'sum': 114355295,
 'min': 1,
 'max': 53908,
 'mean': 6.310025998751958,
 'variance': 9918.666837525623,
 'skewness': 83.28386530442891,
 'kurtosis': 20043.612488253475}

For more details, please refer to the Statistics section of the API docs for the hictkpy.PixelSelector class.