首页 文章

熊猫时间序列重新取样

提问于
浏览
2

我有一份航行清单,包括开始和结束日期以及该航程的收益 . 我想计算月收入,但我不知道如何使用熊猫来做到这一点:

'2016-02-28 07:30:00', '2016-04-30 00:00:00', '600000'
'2016-05-18 10:30:00', '2016-07-12 02:19:00', '700000'

我手动执行此操作的方式是计算每个月中航程的天数,并乘以收益/总航程长度 .

1 回答

  • 2

    您需要检查每个日期范围内的小时数 - 每行 . 因此,请使用DataFrame.apply自定义函数,其中groupbymonths date_range和aggreagate size .

    print (df)
                    start                 end   price
    0 2016-02-28 07:30:00 2016-04-30 00:00:00  600000
    1 2016-05-18 10:30:00 2016-07-12 02:19:00  700000
    
    print (df.dtypes)
    start    datetime64[ns]
    end      datetime64[ns]
    price             int64
    dtype: object
    
    def f(x):
        rng = pd.date_range(x.start, x.end, freq='H')
        return rng.to_series().groupby([rng.month]).size()
    df1 = df.apply(f, axis=1)
    print (df1)
          2      3      4      5      6      7
    0  41.0  744.0  696.0    NaN    NaN    NaN
    1   NaN    NaN    NaN  326.0  720.0  266.0
    

    然后通过 price 除以所有小时的 price 得到 price_per_hour

    price_per_hour = df.price / df1.sum(axis=1)
    print (price_per_hour)
    0    405.131668
    1    533.536585
    dtype: float64
    

    每个 month 的所有小时数为mul的最后一次:

    print (df1.mul(price_per_hour, axis=0))
                  2              3              4              5              6  \
    0  16610.398379  301417.960837  281971.640783            NaN            NaN   
    1           NaN            NaN            NaN  173932.926829  384146.341463   
    
                   7  
    0            NaN  
    1  141920.731707  
    
    #check sum - it is correctly price
    print (df1.mul(price_per_hour, axis=0).sum(axis=1))
    0    600000.0
    1    700000.0
    dtype: float64
    

    您还可以根据 days 计算 prices - 将 freq='h' 更改为 freq='D' ,但我认为它不太准确:

    def f(x):
        rng = pd.date_range(x.start, x.end, freq='D')
        return rng.to_series().groupby([rng.month]).size()
    
    df1 = df.apply(f, axis=1)
    print (df1)
         2     3     4     5     6     7
    0  2.0  31.0  29.0   NaN   NaN   NaN
    1  NaN   NaN   NaN  14.0  30.0  11.0
    
    price_per_hour = df.price / df1.sum(axis=1)
    print (price_per_hour)
    0     9677.419355
    1    12727.272727
    dtype: float64
    
    print (df1.mul(price_per_hour, axis=0))
                 2         3             4              5              6         7
    0  19354.83871  300000.0  280645.16129            NaN            NaN       NaN
    1          NaN       NaN           NaN  178181.818182  381818.181818  140000.0
    0    600000.0
    1    700000.0
    dtype: float64
    
    print (df1.mul(price_per_hour, axis=0).sum(axis=1))
    0    600000.0
    1    700000.0
    dtype: float64
    

    melt,groupby和resample resample重新整形的另一个解决方案 - 也需要groupby by months 和aggreagate size

    df['count'] = df.index
    df1 = pd.melt(df, id_vars=['price', 'count'], value_name='dates')
    print (df1)
        price  count variable               dates
    0  600000      0    start 2016-02-28 07:30:00
    1  700000      1    start 2016-05-18 10:30:00
    2  600000      0      end 2016-04-30 00:00:00
    3  700000      1      end 2016-07-12 02:19:00
    
    df2 = df1.set_index('dates').groupby('count').resample('D').size()
    print (df2)
    count  dates     
    0      2016-02-28    1
           2016-02-29    0
           2016-03-01    0
           2016-03-02    0
           2016-03-03    0
           2016-03-04    0
           2016-03-05    0
           2016-03-06    0
           2016-03-07    0
           2016-03-08    0
           2016-03-09    0
           2016-03-10    0
           2016-03-11    0
           2016-03-12    0
    ...
    ...
    
    print (df2.index.get_level_values('dates').month)
    [2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4
     4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5
     5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7
     7 7 7 7 7 7 7 7]
    
    df3 = df2.groupby([df2.index.get_level_values('count'), 
                       df2.index.get_level_values('dates').month]).size().unstack()
    print (df3)
             2     3     4     5     6     7
    count                                   
    0      2.0  31.0  30.0   NaN   NaN   NaN
    1      NaN   NaN   NaN  14.0  30.0  12.0
    
    price_per_hour = df.price / df3.sum(axis=1)
    print (price_per_hour)
    0     9523.809524
    1    12500.000000
    dtype: float64
    
    print (df3.mul(price_per_hour, axis=0))
                      2              3              4         5         6  \
    count                                                                   
    0      19047.619048  295238.095238  285714.285714       NaN       NaN   
    1               NaN            NaN            NaN  175000.0  375000.0   
                  7  
    count            
    0           NaN  
    1      150000.0  
    
    print (df3.mul(price_per_hour, axis=0).sum(axis=1))
    count
    0    600000.0
    1    700000.0
    dtype: float64
    

相关问题