首页 文章

如何使用Python Pandas在特定切片中制作一片DataFrame和“fillna”?

提问于
浏览
3

问题是:让我们从Kaggle中获取Titanic数据集 . 我的数据框有“Pclass”,“Sex”和“Age”列 . 我需要在“年龄”栏填写NaN,其中某个组的中位数 . 如果是一等女性,我想用一年级女性的中位数填写她的年龄,而不是整个年龄段的中位数 .

问题是如何在某个切片中进行此更改?

我试过了:

data['Age'][(data['Sex'] == 'female')&(data['Pclass'] == 1)&(data['Age'].isnull())].fillna(median)

其中“中位数”是我的值,但没有任何变化“inplace = True”没有帮助 .

非常感谢!

2 回答

  • 2

    我相信你需要通过面具过滤并分配回来:

    data = pd.DataFrame({'a':list('aaaddd'),
                         'Sex':['female','female','male','female','female','male'],
                         'Pclass':[1,2,1,2,1,1],
                         'Age':[40,20,30,20,np.nan,np.nan]})
    
    print (data)
        Age  Pclass     Sex  a
    0  40.0       1  female  a
    1  20.0       2  female  a
    2  30.0       1    male  a
    3  20.0       2  female  d
    4   NaN       1  female  d
    5   NaN       1    male  d
    
    #boolean mask
    mask1 = (data['Sex'] == 'female')&(data['Pclass'] == 1)
    
    #get median by mask without NaNs
    med = data.loc[mask1, 'Age'].median()
    print (med)
    40.0
    
    #repalce NaNs
    data.loc[mask1, 'Age'] = data.loc[mask1, 'Age'].fillna(med)
    print (data)
        Age  Pclass     Sex  a
    0  40.0       1  female  a
    1  20.0       2  female  a
    2  30.0       1    male  a
    3  20.0       2  female  d
    4  40.0       1  female  d
    5   NaN       1    male  d
    

    同样如下:

    mask2 = mask1 &(data['Age'].isnull())
    
    data.loc[mask2, 'Age'] = med
    print (data)
        Age  Pclass     Sex  a
    0  40.0       1  female  a
    1  20.0       2  female  a
    2  30.0       1    male  a
    3  20.0       2  female  d
    4  40.0       1  female  d
    5   NaN       1    male  d
    

    编辑:

    如果需要用中位数替换所有组 NaN s:

    data['Age'] = data.groupby(["Sex","Pclass"])["Age"].apply(lambda x: x.fillna(x.median()))
    print (data)
    
        Age  Pclass     Sex  a
    0  40.0       1  female  a
    1  20.0       2  female  a
    2  30.0       1    male  a
    3  20.0       2  female  d
    4  40.0       1  female  d
    5  30.0       1    male  d
    
  • 1

    如果你想为每个组做同样的事情,你可以使用这个技巧

    data = pd.DataFrame({'a':list('aaaddd'),
                        'Sex':['female','female','male','female','female','male'],
                        'Pclass':[1,2,1,2,1,1],
                        'Age':[40,20,30,20, np.nan, np.nan]})
    df = data.groupby(["Sex","Pclass"])["Age"].median().to_frame().reset_index()
    df.rename(columns={"Age":"Med"}, inplace=True)
    data = pd.merge(left=data,right=df, how='left', on=["Sex", "Pclass"])
    data["Age"] = np.where(data["Age"].isnull(), data["Med"], data["Age"])
    

    UPDATE:

    # dummy dataframe
    n = int(1e7)
    data = pd.DataFrame({"Age":np.random.choice([10,20,20,30,30,40,np.nan], n),
                         "Pclass":np.random.choice([1,2,3], n),
                         "Sex":np.random.choice(["male","female"], n),
                         "a":np.random.choice(["a","b","c","d"], n)})
    

    在我的机器上运行它(就像之前没有重命名)

    df = data.groupby(["Sex","Pclass"])["Age"].agg(['median']).reset_index()
    data = pd.merge(left=data,right=df, how='left', on=["Sex", "Pclass"])
    data["Age"] = np.where(data["Age"].isnull(), data["median"], data["Age"])
    
    CPU times: user 1.98 s, sys: 216 ms, total: 2.2 s
    Wall time: 2.2 s
    

    虽然掩码解决方案采取:

    for sex in ["male", "female"]:
        for pclass in range(1,4):
            mask1 =(data['Sex'] == sex)&(data['Pclass'] == pclass)
            med = data.loc[mask1, 'Age'].median()
            data.loc[mask1, 'Age'] = data.loc[mask1, 'Age'].fillna(med)
    
    CPU times: user 5.13 s, sys: 60 ms, total: 5.19 s
    Wall time: 5.19 s
    

    @jezrael解决方案更快

    data['Age'] = data.groupby(["Sex","Pclass"])["Age"].apply(lambda x: x.fillna(x.median()))
    
    CPU times: user 1.34 s, sys: 92 ms, total: 1.44 s
    Wall time: 1.44 s
    

相关问题