首页 文章

Pandas vectorization:计算满足条件的每个组的分数

提问于
浏览
2

假设我们有一张顾客及其消费表 .

import pandas as pd
df = pd.DataFrame({
    "Name":  ["Alice", "Bob", "Bob", "Charles"],
    "Spend": [3, 5, 7, 9]
})
LIMIT = 6

对于每个客户,我们可以使用 apply 方法计算大于6美元的支出比例:

df.groupby("Name").apply(
    lambda grp: len(grp[grp["Spend"] > LIMIT]) / len(grp)
)

Name
Alice      0.0
Bob        0.5
Charles    1.0

但是, apply 方法is just a loop,如果有很多客户,则会很慢 .

Question: Is there a faster way, which presumably uses vectorization?

下面的代码导致Alice的空值:

df[df["Spend"] > LIMIT].groupby("Name").size() / df.groupby("Name").size()

Name
Alice      NaN
Bob        0.5
Charles    1.0

下面的代码给出了正确的结果,但它要求我们修改表格,或者制作副本以避免修改原始文件 .

df["Dummy"] = 1 * (df["Spend"] > LIMIT)
df.groupby("Name") ["Dummy"] .sum() / df.groupby("Name").size()

1 回答

  • 0

    有没有更快的方法,可能使用矢量化?

    Groupby不使用矢量化,有cython优化函数 .


    你可以聚合 mean

    df1 = df["Spend"].gt(LIMIT).groupby(df["Name"]).mean()
    

    或者使用div替换 NaN0

    df1 = (df[df["Spend"] > LIMIT].groupby("Name").size()
                      .div(df.groupby("Name").size(), fill_value=0))
    

    要么:

    df1 = (df["Spend"].gt(LIMIT).groupby(df["Name"]).sum()
                      .div(df.groupby("Name").size(), fill_value=0))
    

    print (df1)
    Name
    Alice      0.0
    Bob        0.5
    Charles    1.0
    dtype: float64
    

    Performance

    取决于每个条件过滤的行数和行数,因此最好在实际数据中进行测试 .

    np.random.seed(123)
    
    N = 100000
    df = pd.DataFrame({
        "Name":  np.random.randint(1000, size=N),
        "Spend": np.random.randint(10, size=N)
    })
    LIMIT = 6
    
    In [10]: %timeit df["Spend"].gt(LIMIT).groupby(df["Name"]).mean()
    6.16 ms ± 332 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    In [11]: %timeit df[df["Spend"] > LIMIT].groupby("Name").size().div(df.groupby("Name").size(), fill_value=0)
    6.35 ms ± 95.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    In [12]: %timeit (df["Spend"].gt(LIMIT).groupby(df["Name"]).sum().div(df.groupby("Name").size(), fill_value=0))
    9.66 ms ± 365 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    #RafaelC comment solution
    In [13]: %timeit df.groupby('Name')['Spend'].apply(lambda s: (s>LIMIT).sum()/s.size)
    400 ms ± 27.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    In [14]: %timeit df.groupby('Name')['Spend'].apply(lambda s: (s>LIMIT).mean())
    328 ms ± 6.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    Numpy解决方案是矢量化的,但有点复杂:

    In [15]: %%timeit
        ...: i, r = pd.factorize(df['Name'])
        ...: a = pd.Series(np.bincount(i), index=r)
        ...: 
        ...: i1, r1 = pd.factorize(df['Name'].values[df["Spend"].values > LIMIT])
        ...: b = pd.Series(np.bincount(i1), index=r1)
        ...: 
        ...: df1 = b.div(a, fill_value=0)
        ...: 
    5.05 ms ± 82.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

相关问题