我有以下一般格式的数据,我想重新采样到30天的时间序列窗口:
'customer_id','transaction_dt','product','price','units'
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25
3,2017-07-15,thing3,55,17
3,2016-05-12,thing3,55,47
4,2012-02-23,thing2,150,22
4,2009-10-10,thing1,25,12
4,2014-04-04,thing2,150,2
5,2008-07-09,thing2,150,43
我希望30天的窗口能够在2014-01-01开始,并在12-31-2018结束 . 不保证每个客户都会在每个窗口中都有记录 . 如果客户在一个窗口中有多个交易,则它采用价格的加权平均值,对单位求和,并连接产品名称,以便为每个窗口的每个客户创建一个记录 .
到目前为止我所拥有的是这样的:
wa = lambda x:np.average(x, weights=df.loc[x.index, 'units'])
con = lambda x: '/'.join(x))
agg_funcs = {'customer_id':'first',
'product':'con',
'price':'wa',
'transaction_dt':'first',
'units':'sum'}
df_window = df.groupby(['customer_id', pd.Grouper(freq='30D')]).agg(agg_funcs)
df_window_final = df_window.unstack('customer_id', fill_value=0)
如果有人知道一些更好的方法来解决这个问题(特别是使用就地和/或矢量化方法),我将不胜感激 . 理想情况下,我还想将窗口的开始和停止日期添加为行的列 .
理想情况下,最终输出看起来像这样:
'customer_id','transaction_dt','product','price','units','window_start_dt','window_end_dt'
1,2004-01-02,thing1/thing2,(weighted average price),(total units),(window_start_dt),(window_end_dt)
2,2004-01-29,thing2,(weighted average price),(total units),(window_start_dt),(window_end_dt)
3,2017-07-15,thing3,(weighted average price),(total units),(window_start_dt),(window_end_dt)
3,2016-05-12,thing3,(weighted average price),(total units),(window_start_dt),(window_end_dt)
4,2012-02-23,thing2,(weighted average price),(total units),(window_start_dt),(window_end_dt)
4,2009-10-10,thing1,(weighted average price),(total units),(window_start_dt),(window_end_dt)
4,2014-04-04,thing2,(weighted average price),(total units),(window_start_dt),(window_end_dt)
5,2008-07-09,thing2,(weighted average price),(total units),(window_start_dt),(window_end_dt)
1 回答
编辑新解决方案 . 我认为您可以将每个
transaction_dt
转换为30天的Period对象,然后进行分组 .我们现在可以使用此数据框来获得产品的连接,加权平均价格和单位总和 . 然后,我们使用一些Period功能来获取结束时间 .