h5py Learning Core Concepts

Posted by ryanbutler on Tue, 09 Nov 2021 05:42:12 +0100

Because pandas to_ The HDF5 function has a bug   TypeError: object of type 'int' has no len() , there was an error writing the dataframe data, so I decided to write the data directly using h5py.

The following translations are from https://www.h5py.org/

Core concepts

The h5py package is a Python interface for the HDF5 binary data format

HDF5 allows you to store large amounts of numeric data and easily manipulate NumPy data. For example, you can slice multiple T-size data stored on disk as if they were real Numpy arrays. Thousands of datasets can be stored in a single file and can be classified and tagged as needed.

 

An HDF5 file is a container that contains two objects: datasets, which are collections of data similar to arrays. Groups, which are folder-like containers containing datasets and other groups.

The most basic thing about when to use h5py is:

Groups work like a dictionary, while datasets work like NumPy arrays.

Suppose someone sent you an HDF5 file, mytestfile.hdf5.   (Read about how to create this file appendix : Create a file) The first thing you need to do is open the file for reading:

>>> import h5py
>>> f = h5p.file('mytestfile.hdf5', 'r')

this File object Is your starting point. What is stored in this file? remember h5py.File file It's like a Python dictionary, so we can look at the keys,

>>> list(f.keys())
['mydataset']

According to our observation, there is a set of data in the file, mydataset. Let's make this dataset Dataset object

>>> dset = f['mydataset']

What we get is not an array, but HDF5 dataset As with NumPy arrays, datasets have both shapes and data types:

>>> dset.shape
(100,)
>>> dset.dtype
dtype('int32')

They also support the same slices as arrays. This is how you read and write data from datasets in files:

>>> dset[...] = np.arange(100)
>>> dset[0]
0
>>> dset[10]
10
>>> dset[0:100:10]
array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90])

For more information, see File object and Dataset Collection.

 

Appendix: Creating Files

At this point, you may want to know how the mytestdata.hdf5 file was created. When I initialize the file object, I set the model to w. Other modes are a (for read/write/create access) and r+ (for read/write access).

>>> import h5py
>>> import numpy as np
>>> f = h5py.File("mytestfile.hdf5", "w")

this file object There are several interesting ways. One of them is create_dataset, as its name implies, is a dataset that creates a given shape and data type

>>> dset = f.create_dataset("mydataset", (100,), dtype='i')

The file object is a context manager; So the following code works as well

>>> import h5py
>>> import numpy as np
>>> with h5py.File("mytestfile.hdf5", "w") as f:
>>>     dset = f.create_dataset("mydataset", (100,), dtype='i')

Groups and Hierarchy

"HDF" means "Hierarchical Data Format Layered Data Format". Each object in the HDF5 file has a name, and they use  /  Separators are arranged in a POSIX-style hierarchy:

>>> dset.name
'/mydataset'

The "folders" in this system are called groups.   The file object we create is itself a group, in this case root group, named  /:

>>> f.name
'/'

Subgroups are created by properly named create_group implementation.   But first we need to open the file in "append" mode (read/write if it exists, otherwise create):

>>> f = h5py.File('mydataset.hdf5', 'a')
>>> grp = f.create_group("subgroup")

Like file objects, all Group objects have create_* Method:

>>> dset2 = grp.create_dataset("another_dataset", (50,), dtype='f')
>>> dset2.name
'/subgroup/another_dataset'

By the way, you don't have to create all the intermediate groups manually. Just specify the full path: (so-called hierarchy creation)

>>> dset3 = f.create_dataset('subgroup2/dataset_three', (10,), dtype='i')
>>> dset3.name
'/subgroup2/dataset_three'

Group supports most Python dictionary-style interfaces. Retrieve objects in a file using the item-retrieval syntax:

>>> dataset_three = f['subgroup2/dataset_three']

Iterative groups can provide the names of their members:

>>> for name in f:
...     print(name)
mydataset
subgroup
subgroup2
 

You can also use the name to test if a member exists:

>>> "mydataset" in f
True
>>> "somethingelse" in f
False

You can even use the full path name:

>>> "subgroup/another_dataset" in f
True

 

There are also familiar keys(), values(), items(), and iter() methods, and get() methods.

Since iterating over a group only results in its directly connected members, iteration over the entire file is done using the group method visit() and visititems(), which requires a call:

>>> def printname(name):
...     print(name)
>>> f.visit(printname)
mydataset
subgroup
subgroup/another_dataset
subgroup2
subgroup2/dataset_three

For more information, see group.

attribute

One of the best features of HDF5 is that you can immediately store the metadata it describes. All groups and datasets support an additional string of properties.

Properties implement dictionary interfaces through attrs proxy objects:

>>> dset.attrs['temperature'] = 99.5
>>> dset.attrs['temperature']
99.5
>>> 'temperature' in dset.attrs
True