首页 文章

从数据框创建查找表

提问于
浏览
1

我想基于多个条件从数据框创建查找表 . 我有以下df:

N = 100
L = ['AR1', 'PO1', 'RU1']

np.random.seed(0)

df3 = pd.DataFrame(
    {'X':np.random.uniform(1,4,N),
     'Y':np.random.uniform(1,4,N),
     'Z':np.random.uniform(1,4,N),
     'LG':np.random.choice(L,N),
    })

df3['bins_X'] = df3.groupby('LG')['X'].apply(pd.qcut, q=5, labels=np.arange(5))
df3['bins_Y'] = df3.groupby('LG')['Y'].apply(pd.qcut, q=5, labels=np.arange(5))
df3['bins_Z'] = df3.groupby('LG')['Z'].apply(pd.qcut, q=5, labels=np.arange(5))
df3['bins_X_int'] = df3.groupby('LG')['X'].apply(pd.qcut, q=5)
df3['bins_Y_int'] = df3.groupby('LG')['Y'].apply(pd.qcut, q=5)
df3['bins_Z_int'] = df3.groupby('LG')['Z'].apply(pd.qcut, q=5)

df3.head()

enter image description here

我想从中创建以下lookup_table:

enter image description here

因此,按“LG”和从0到4排序的分组进行分组 . 我需要的是示例lookup_table,其中填充了来自数据帧的关联bin_intervals .

2 回答

  • 2

    IIUC:

    def get_ints(s, q):
        return pd.Series(pd.qcut(s, q).sort_values().unique())
    
    d1 = df3.set_index('LG')[list('XYZ')].stack()
    g = d1.groupby(level=[0, 1])
    g.apply(get_ints, q=5).unstack(1).rename_axis(['LG', 'bin_number']).reset_index()
    
         LG  bin_number                X                Y                Z
    0   AR1           0   [1.306, 1.926]  [1.0556, 1.875]  [1.0493, 1.819]
    1   AR1           1   (1.926, 2.447]   (1.875, 2.757]   (1.819, 2.595]
    2   AR1           2   (2.447, 2.812]  (2.757, 3.0724]    (2.595, 2.95]
    3   AR1           3  (2.812, 3.0744]  (3.0724, 3.376]    (2.95, 3.334]
    4   AR1           4  (3.0744, 3.936]   (3.376, 3.803]   (3.334, 3.885]
    5   PO1           0  [1.0564, 1.286]  [1.0955, 1.566]   [1.074, 1.596]
    6   PO1           1   (1.286, 1.868]   (1.566, 1.911]   (1.596, 1.895]
    7   PO1           2   (1.868, 2.682]   (1.911, 2.669]   (1.895, 2.426]
    8   PO1           3    (2.682, 3.29]   (2.669, 2.958]   (2.426, 3.283]
    9   PO1           4    (3.29, 3.965]   (2.958, 3.676]   (3.283, 3.848]
    10  RU1           0  [1.0141, 1.452]  [1.0351, 2.158]  [1.0397, 1.632]
    11  RU1           1   (1.452, 1.983]    (2.158, 2.49]   (1.632, 2.223]
    12  RU1           2   (1.983, 2.622]   (2.49, 3.0893]  (2.223, 3.0732]
    13  RU1           3   (2.622, 3.226]  (3.0893, 3.673]  (3.0732, 3.729]
    14  RU1           4   (3.226, 3.929]   (3.673, 3.997]   (3.729, 3.971]
    
  • 1

    IIUC你可以这样做:

    In [55]: lkp = df3[['LG']].sort_values('LG').copy()
    
    In [56]: lkp['bin_number'] = lkp.groupby('LG').cumcount()
    
    In [57]: lkp
    Out[57]:
         LG  bin_number
    0   AR1           0
    46  AR1           1
    25  AR1           2
    26  AR1           3
    57  AR1           4
    28  AR1           5
    29  AR1           6
    56  AR1           7
    31  AR1           8
    32  AR1           9
    ..  ...         ...
    45  RU1          24
    98  RU1          25
    55  RU1          26
    58  RU1          27
    60  RU1          28
    61  RU1          29
    63  RU1          30
    64  RU1          31
    39  RU1          32
    99  RU1          33
    
    [100 rows x 2 columns]
    

相关问题