首页 文章

比较Pandas数据帧并添加列

提问于
浏览
2

我有两个数据帧,如下所示

df1     df2 
A       A   C
A1      A1  C1
A2      A2  C2
A3      A3  C3
A1      A4  C4
A2          
A3          
A4

列'A'的值在列'C'中的df2中定义 . 我想在列d的df1中添加一个新列,其值来自df2列'C'

最终的df1应该是这样的

df1
A   B
A1  C1
A2  C2
A3  C3
A1  C1
A2  C2
A3  C3
A4  C4

我可以遍历df2并将值添加到df1,但由于数据量很大,所以耗费时间 .

for index, row in df2.iterrows():
           df1.loc[df1.A.isin([row['A']]), 'B']= row['C']

有人可以帮助我理解如何在不绕过df2的情况下解决这个问题 .

谢谢

3 回答

  • 1

    IIUC你可以合并并重命名col

    df1.merge(df2, on='A', how='left').rename(columns={'C':'B'})
    
    In [103]:
    df1 = pd.DataFrame({'A':['A1','A2','A3','A1','A2','A3','A4']})
    df2 = pd.DataFrame({'A':['A1','A2','A3','A4'], 'C':['C1','C2','C4','C4']})
    merged = df1.merge(df2, on='A', how='left').rename(columns={'C':'B'})
    merged
    
    Out[103]:
        A   B
    0  A1  C1
    1  A2  C2
    2  A3  C4
    3  A1  C1
    4  A2  C2
    5  A3  C4
    6  A4  C4
    
  • 1

    您可以使用map by Series

    df1['B'] = df1.A.map(df2.set_index('A')['C'])
    print (df1)
        A   B
    0  A1  C1
    1  A2  C2
    2  A3  C3
    3  A1  C1
    4  A2  C2
    5  A3  C3
    6  A4  C4
    

    dictmap 相同:

    d = df2.set_index('A')['C'].to_dict()
    print (d)
    {'A4': 'C4', 'A3': 'C3', 'A2': 'C2', 'A1': 'C1'}
    
    df1['B'] = df1.A.map(d)
    print (df1)
        A   B
    0  A1  C1
    1  A2  C2
    2  A3  C3
    3  A1  C1
    4  A2  C2
    5  A3  C3
    6  A4  C4
    

    Timings

    len(df1)=7

    In [161]: %timeit merged = df1.merge(df2, on='A', how='left').rename(columns={'C':'B'})
    1000 loops, best of 3: 1.73 ms per loop
    
    In [162]: %timeit df1['B'] = df1.A.map(df2.set_index('A')['C'])
    The slowest run took 4.44 times longer than the fastest. This could mean that an intermediate result is being cached.
    1000 loops, best of 3: 873 µs per loop
    

    len(df1)=70k

    In [164]: %timeit merged = df1.merge(df2, on='A', how='left').rename(columns={'C':'B'})
    100 loops, best of 3: 12.8 ms per loop
    
    In [165]: %timeit df1['B'] = df1.A.map(df2.set_index('A')['C'])
    100 loops, best of 3: 6.05 ms per loop
    
  • 1

    基于searchsorted方法,这里有三种不同索引方案的方法 -

    df1['B'] = df2.C[df2.A.searchsorted(df1.A)].values
    df1['B'] = df2.C[df2.A.searchsorted(df1.A)].reset_index(drop=True)
    df1['B'] = df2.C.values[df2.A.searchsorted(df1.A)]
    

相关问题