Creating .cool and .hic files#

hictkpy supports creating .cool and .hic files from pre-binned interactions in COO or BedGraph2 format.

The example use file 4DNFIOTPSS3L.hic, which can be downloaded from here.

Preparation#

The first step consists of converting interactions from 4DNFIOTPSS3L.hic to bedGraph2 format. This can be achieved using hictk dump

user@dev:/tmp$ hictk dump --join 4DNFIOTPSS3L.hic --resolution 50000 > pixels.bg2

user@dev:/tmp$ head pixels.bg2

  2L  0       50000   2L      0       50000   30211
  2L  0       50000   2L      50000   100000  13454
  2L  0       50000   2L      100000  150000  2560
  2L  0       50000   2L      150000  200000  911
  2L  0       50000   2L      200000  250000  753
  2L  0       50000   2L      250000  300000  846
  2L  0       50000   2L      300000  350000  530
  2L  0       50000   2L      350000  400000  378
  2L  0       50000   2L      400000  450000  630
  2L  0       50000   2L      450000  500000  756

Next, we also generate the list of chromosomes.

user@dev:/tmp$ hictk dump -t chroms 4DNFIOTPSS3L.hic > chrom.sizes

user@dev:/tmp$ head chrom.sizes.bg2

  2L  23513712
  2R  25286936
  3L  28110227
  3R  32079331
  4   1348131
  X   23542271
  Y   3667352

Ingesting interactions in a .cool file#

In [1]: import hictkpy as htk

In [2]: import pandas as pd

# Create a dictionary mapping chromosome names to chromosome sizes
In [3]: chroms = pd.read_table("chrom.sizes", names=["name", "length"])
    ...            .set_index("name")["length"]
    ...            .to_dict()

In [4]: chroms
Out[4]:
{'2L': 23513712,
 '2R': 25286936,
 '3L': 28110227,
 '3R': 32079331,
 '4': 1348131,
 'X': 23542271,
 'Y': 3667352}

# Initialize an empty .cool file
In [5]: f = htk.cooler.FileWriter("out.cool", chroms, resolution=50_000)

In [6]: cols = ["chrom1", "start1", "end1",
    ...         "chrom2", "start2", "end2",
    ...         "count"]

# Loop over chunks of interactions and progressively add them to "out.cool"
In [7]: for df in pd.read_table("pixels.bg2", names=cols, chunksize=1_000_000):
  ...:      f.add_pixels(df)
  ...:

# Important! If you forget to call f.finalize() the resulting .cool file will be empty
In [8]: f.finalize()

# Check that the resulting file has some interactions
In [9]: htk.File("out.cool").attributes()["nnz"]
Out[9]: 3118456

Ingesting interactions in a .hic file#

Follow the same step as in the previous section and replace htk.cooler.File with htk.hic.File.

Tips and tricks#

When loading interactions into a .cool or .hic file, interactions are initially stored in a temporary file. When loading a large number of interactions, this temporary file can grow to be quite large. When this is the case, it is wise to pass a custom temporary folder where temporary files will be created:

In [1]: f = htk.cooler.FileWriter("out.cool", chroms, resolution=50_000, tmpdir="/var/tmp/hictk")

When ingesting interactions in a .hic file, performance can be improved by using multiple threads:

In [1]: f = htk.hic.FileWriter("out.hic", chroms, resolution=50_000, n_threads=8)

When memory allows it, it is possible to bypass temporary files by specifying a very large chunk size and ingesting all interactions at once. This can significantly speed up file creation.

# Initialize an empty .cool file

In [1]: cols = ["chrom1", "start1", "end1",
    ...         "chrom2", "start2", "end2",
    ...         "count"]

In [2]: df = pd.read_table("pixels.bg2", names=cols)

In [3]: f = htk.cooler.FileWriter("out.cool", chroms, resolution=50_000, chunk_size=len(df) + 1)

In [4]: f.add_pixels(df)

In [5]: f.finalize()