Creating .cool and .hic files¶
hictkpy supports creating .cool and .hic files from pre-binned interactions in COO or BedGraph2 format.
The examples in this section use file 4DNFIOTPSS3L.hic, which can be downloaded from the 4D Nucleome Data Portal here.
Preparation¶
The first step involves converting interactions from 4DNFIOTPSS3L.hic to bedGraph2 format.
This can be achieved using hictk dump (or alternatively with hictkpy.File.fetch()).
user@dev:/tmp$ hictk dump --join 4DNFIOTPSS3L.hic --resolution 50000 > pixels.bg2
user@dev:/tmp$ head pixels.bg2
2L 0 50000 2L 0 50000 30211
2L 0 50000 2L 50000 100000 13454
2L 0 50000 2L 100000 150000 2560
2L 0 50000 2L 150000 200000 911
2L 0 50000 2L 200000 250000 753
2L 0 50000 2L 250000 300000 846
2L 0 50000 2L 300000 350000 530
2L 0 50000 2L 350000 400000 378
2L 0 50000 2L 400000 450000 630
2L 0 50000 2L 450000 500000 756
Next, we also generate the list of chromosomes to use as reference.
user@dev:/tmp$ hictk dump -t chroms 4DNFIOTPSS3L.hic > chrom.sizes
user@dev:/tmp$ head chrom.sizes
2L 23513712
2R 25286936
3L 28110227
3R 32079331
4 1348131
X 23542271
Y 3667352
Ingesting interactions in a .cool file¶
In [1]: import hictkpy as htk
In [2]: import pandas as pd
# Create a dictionary mapping chromosome names to chromosome sizes
In [3]: chroms = pd.read_table("chrom.sizes", names=["name", "length"])
...: .set_index("name")["length"]
...: .to_dict()
In [4]: chroms
Out[4]:
{'2L': 23513712,
'2R': 25286936,
'3L': 28110227,
'3R': 32079331,
'4': 1348131,
'X': 23542271,
'Y': 3667352}
# Define the name of the columns for later use
In [5]: cols = ["chrom1", "start1", "end1",
...: "chrom2", "start2", "end2",
...: "count"]
# Initialize an empty .cool file
In [6]: with htk.cooler.FileWriter("out.cool", chroms, resolution=50_000) as writer:
...: # Lazily load pixels in chunks to reduce memory usage
...: pixels = pd.read_table("pixels.bg2", names=cols, chunksize=1_000_000)
...: # Add chunks of pixels one by one
...: for i, df in enumerate(pixels):
...: print(f"adding chunk #{i}...")
...: writer.add_pixels(df)
...:
adding chunk #0...
adding chunk #1...
adding chunk #2...
adding chunk #3...
# Check that the resulting file has some interactions
In [7]: htk.File("out.cool").attributes()["nnz"]
Out[7]: 3118456
Ingesting interactions in a .hic file¶
Follow the same steps as above for .cool files, but replace htk.cooler.FileWriter with htk.hic.FileWriter.
Tips and tricks¶
When loading interactions into a .cool or .hic file, interactions are initially stored in a temporary file. For a large number of interactions, this temporary file can become quite large. In such cases, it may be appropriate to pass a custom temporary folder where these files will be created:
In [1]: f = htk.cooler.FileWriter("out.cool", chroms, resolution=50_000, tmpdir="/var/tmp/hictk")
When ingesting interactions in a .hic file, performance can be improved by using multiple threads:
In [1]: f = htk.hic.FileWriter("out.hic", chroms, resolution=50_000, n_threads=8)
When memory allows, it is possible to bypass temporary file creation by specifying a very large chunk size and ingesting all interactions at once. This can significantly speed up file creation:
# Initialize an empty .cool file
In [1]: cols = ["chrom1", "start1", "end1",
...: "chrom2", "start2", "end2",
...: "count"]
In [2]: df = pd.read_table("pixels.bg2", names=cols)
In [3]: with htk.cooler.FileWriter("out.cool", chroms, resolution=50_000, chunk_size=len(df) + 1) as writer:
...: writer.add_pixels(df)
...:
In case it is not possible to install a compatible version of pandas or pyarrow, the FileWriter
classes support ingesting interactions from dictionaries of iterables
(e.g., a dictionary mapping keys bin1_id, bin2_id, and count to iterables yielding numbers of the appropriate type).
For more details, refer to the documentation for
hictkpy.cooler.FileWriter.add_pixels_from_dict()
and
hictkpy.hic.FileWriter.add_pixels_from_dict().