我有一个大型数据帧(400,000行),如下所示:
data = np.array([
[1949, '01/01/2018', np.nan, 17, '30/11/2017'],
[1949, '01/01/2018', np.nan, 19, np.nan],
[1811, '01/01/2018', 16, np.nan, '31/11/2017'],
[1949, '01/01/2018', 15, 21, '01/12/2017'],
[1949, '01/01/2018', np.nan, 20, np.nan],
[3212, '01/01/2018', 21, 17, '31/11/2017']
])
columns = ['id', 'ReceivedDate', 'PropertyType', 'MeterType', 'VisitDate']
pd.DataFrame(data, columns=columns)
结果df:
id ReceivedDate PropertyType MeterType VisitDate
0 1949 01/01/2018 NaN 17 30/11/2017
1 1949 01/01/2018 NaN 19 NaN
2 1811 01/01/2018 16 NaN 31/11/2017
3 1949 01/01/2018 15 21 01/12/2017
4 1949 01/01/2018 NaN 20 NaN
5 3212 01/01/2018 21 17 31/11/2017
我想基于groupby(id和接收日期)转发填充 - 只要它们在索引中按顺序进入(即只有前向填充索引位置1和4) .
我想有一个专栏,说明是否应根据标准填写,但我如何查看上面的行?
(我计划使用这个答案的解决方案:pandas fill forward performance issue
df.isnull().astype(int)).groupby(level=0).cumsum().applymap(lambda x: None if x == 0 else 1)
因为 x = df.groupby(['id','ReceivedDate']).ffill()
非常慢 . )
所需的df:
id ReceivedDate PropertyType MeterType VisitDate
0 1949 01/01/2018 NaN 17 30/11/2017
1 1949 01/01/2018 NaN 19 30/11/2017
2 1811 01/01/2018 16 NaN 31/11/2017
3 1949 01/01/2018 15 21 01/12/2017
4 1949 01/01/2018 15 20 01/12/2017
5 3212 01/01/2018 21 17 31/11/2017
2 回答
Option 1
groupby
ffill
与limit=1
-Option 2
代替
groupby
ffill
,尝试用groupby
,mask
和shift
填充NaNs -要么,
保持循环直到不再匹配(即所有列都向前填充 . )