列表张量/ pandas-dataframe / numpy-array的字典-Java 学习之路

我是熊猫和numpy的初学者

我正在使用this paper中提到的数据集，

我有几个图像，每个图像由某些视觉描述符描述，如CM，CN，GLRLM（这些描述符的含义并不重要），这些视觉描述符基本上是列表 .

所以我的数据结构是：

idsDict = {
    12312: {
         "CM": [2, 3, 1, 5, 1],
         "CN" : [1, 4, 5, 1]
    },
    21367: {
         "GLRLM": [9, 4, 1, 4, 5, 12, 67, 12],
         "CM"   : [1, 6, 8, 1, 34]
    }
}

12312,21336是图像的图像ID

我想将其转换为张量/小数组（3D）/ pandas-dataframe（3D），以便我可以根据描述符找到图像之间的距离 .

基本上张量/ numpy-array（3D）/ pandas-dataframe（3D）的结构将是一个长方体，其中行作为图像ID，列作为描述符，z轴将包含描述符的值

我读过了，

Construct pandas DataFrame from items in nested dictionary

Pandas dataframe to dict of dict

1 回答

在计算速度方面，你可能最好使用Numpy：

import numpy as np

idsDict = {
    12312: {
      "CM": [2, 3, 1, 5, 1],
      "CN" : [1, 4, 5, 1]
    },
    21367: {
      "GLRLM": [9, 4, 1, 4, 5, 12, 67, 12],
      "CM"   : [1, 6, 8, 1, 34]
    }
}

# loop through once to figure out size of final data structure
dscr = {}
maxlen = 0
for d in idsDict.values():
    for descName,desc in d.items():
        if descName not in dscr:
            dscr[descName] = np.obj2sctype(desc[0]) if len(desc) else np.int64
        if len(desc) > maxlen:
            maxlen = len(desc)

# allocate a masked structured array of the right shape and dtype
dtype = np.dtype(sorted(dscr.items()))
_data3d = np.empty((len(idsDict), maxlen), dtype=dtype)
data3d = np.ma.array(_data3d, mask=True)

# copy the data over the array
for d,drow in zip(idsDict.values(), data3d):
    for descName,desc in d.items():
        drow[descName][:len(desc)] = desc

print(data3d.dtype.names,'\n')
print(data3d.T)

哪个输出：

('CM', 'CN', 'GLRLM')

[[(2.0, 1.0, --) (1.0, --, 9.0)]
 [(3.0, 4.0, --) (6.0, --, 4.0)]
 [(1.0, 5.0, --) (8.0, --, 1.0)]
 [(5.0, 1.0, --) (1.0, --, 4.0)]
 [(1.0, --, --) (34.0, --, 5.0)]
 [(--, --, --) (--, --, 12.0)]
 [(--, --, --) (--, --, 67.0)]
 [(--, --, --) (--, --, 12.0)]]

不幸的是，没有好的方法可以将图像ID保存在Numpy结构化数组中 . 如果您需要，可以使用Pandas . 以下是如何在单个Pandas 3D数据帧中压缩所有数据：

import pandas as pd

idsDict = {
    12312: {
      "CM": [2, 3, 1, 5, 1],
      "CN" : [1, 4, 5, 1]
    },
    21367: {
      "GLRLM": [9, 4, 1, 4, 5, 12, 67, 12],
      "CM"   : [1, 6, 8, 1, 34]
    }
}

# loop through once to figure out size of final data structure
descNames = set()
maxlen = 0
for d in idsDict.values():
    for descName,desc in d.items():
        descNames.add(descName)
        if len(desc) > maxlen:
            maxlen = len(desc)

# pad data
padDesc = maxlen*[np.nan]
for d in idsDict.values():
    for desc in d.values():
        dlen = len(desc)
        if dlen < maxlen:
            desc.extend((maxlen - dlen)*[np.nan])
    for descName in (n for n in descNames if n not in d):
        d[descName] = padDesc

data3d = pd.concat([pd.DataFrame(d) for id,d in idsDict.items()], keys=idsDict.keys())
print(data3d)

这输出：

CM   CN  GLRLM
12312 0   2.0  1.0    NaN
      1   3.0  4.0    NaN
      2   1.0  5.0    NaN
      3   5.0  1.0    NaN
      4   1.0  NaN    NaN
      5   NaN  NaN    NaN
      6   NaN  NaN    NaN
      7   NaN  NaN    NaN
21367 0   1.0  NaN    9.0
      1   6.0  NaN    4.0
      2   8.0  NaN    1.0
      3   1.0  NaN    4.0
      4  34.0  NaN    5.0
      5   NaN  NaN   12.0
      6   NaN  NaN   67.0
      7   NaN  NaN   12.0

回复于 2024-04-29T09:19:34+08:00

列表张量/ pandas-dataframe / numpy-array的字典

1 回答

相关问题