将ndarray存储在PyTable中（以及如何定义Col（）

-1

TL; DR：我有一个带有float32 Col的PyTable，在写入numpy-float32-array时会出错 . (How) can I store a numpy-array (float32) in the Column of a PyTables table?

我是PyTables的新手 - 按照TFtables（在Tensorflow中使用HDF5的lib）的建议，我用它来存储我的所有HDF5数据（目前分批分布在几个文件中，每个三个数据集）在一个表中单个HDF5文件 . 数据集是

'data' : (n_elements, 1024, 1024, 4)@float32
'label' : (n_elements, 1024, 1024, 1)@uint8
'weights' : (n_elements, 1024, 1024, 1)@float32

n_elements 分布在我想要合并到一个文件的几个文件中（允许无序访问） .

因此，当我构建表时，我认为每个数据集代表一个列 . 我以通用方式构建了所有内容，允许对任意数量的数据集执行此操作：

# gets dtypes (and shapes) of the dsets (accessed by dset_keys = ['data', 'label', 'weights']
dtypes, shapes = _determine_shape(hdf5_files, dset_keys)

# to dynamically generate a table, I'm using a dict (not a class as in the PyTables tutorials)
# the dict is (conform with the doc): { 'col_name' : Col()-class-descendent }
table_description = {dset_keys[i]: tables.Col.from_dtype(dtypes[i]) for i in range(len(dset_keys))}

# create a file, a group-node and attach a table to it
h5file = tables.open_file(destination_file, mode="w", title="merged")
group = h5file.create_group("/", 'main', 'Node for data table')
table = h5file.create_table(group, 'data_table', table_description, "Collected data with %s" % (str(val_keys)))

我为每个dsets（用h5py读取）获得的dtypes显然是读取dset的numpy数组（ndarray）返回的： float32 或 uint8 . 所以Col（） - 类型是 Float32Col 和 UInt8Col . 我天真地假设我现在可以在这个col中编写一个float32数组，但是用以下内容填充数据：

dummy_data = np.zeros([1024,1024,3], float32) # normally data read from other files

sample = table.row
sample['data'] = dummy_data

结果 `TypeError: invalid type (<class 'numpy.ndarray'>) for column ``data``` . 所以现在我觉得愚蠢的假设我能够在那里写一个数组，但是没有提供"ArrayCol()"类型，PyTables doc中是否有任何关于是否或如何将数组写入列的提示 . 我该怎么做呢？

Col（）类中有“形状”参数，它是后代，所以它应该是可能的，否则这些是什么？！

1 回答

0
Edit: 我刚看到 tables.Col.from_type(type, shape) 允许使用类型的精度（float32而不是float） . 其余的保持不变（采用字符串和形状） .

工厂函数 tables.Col.from_kind(kind, shape) 可用于构造支持ndarrays的Col-Type . "kind"是什么以及如何使用它在我找到的任何地方都没有记录;但是通过反复试验，我发现允许的"kind"是 strings 的基本数据类型 . 即： 'float' ， 'uint' ，... without the precision （不是 'float64' ）

由于我从h5py读取数据集（ dset.dtype ）得到numpy.dtypes，因此必须将这些转换为str并且需要删除精度 . 最后，相关的行看起来像这样：
```
# get key, dtype and shapes of elements per dataset from the datasource files
val_keys, dtypes, element_shapes = _get_dtypes(datasources, element_axis=element_axis)

# for storing arrays in columns apparently one has to use "kind"
# "kind" cannot be created with dtype but only a string representing 
# the dtype w/o precision, e.g. 'float' or 'uint' 
dtypes_kind = [''.join(i for i in str(dtype) if not i.isdigit()) for dtype in dtypes]

# create table description as dictionary
description = {val_keys[i]: tables.Col.from_kind(dtypes_kind[i], shape=element_shapes[i]) for i in range(len(val_keys))}
```
然后将数据写入表中最终按照建议工作：
```
sample = table.row
sample[key] = my_array
```
由于这一切都感觉有点“hacky”并没有很好地记录，我仍然想知道，这是否不是PyTables的预期用途，并且会留下这个问题，以便查看s.o.了解更多...
回复于 2024-05-06T23:21:40+08:00

将ndarray存储在PyTable中（以及如何定义Col（） - 类型）

1 回答

相关问题