首页 文章

Python Pandas:检测时间序列的频率

提问于
浏览
6

假设我已经从sql或csv(不是在python中创建)加载了时间序列数据,索引将是:

DatetimeIndex(['2015-03-02 00:00:00', '2015-03-02 01:00:00',
               '2015-03-02 02:00:00', '2015-03-02 03:00:00',
               '2015-03-02 04:00:00', '2015-03-02 05:00:00',
               '2015-03-02 06:00:00', '2015-03-02 07:00:00',
               '2015-03-02 08:00:00', '2015-03-02 09:00:00', 
               ...
               '2015-07-19 14:00:00', '2015-07-19 15:00:00',
               '2015-07-19 16:00:00', '2015-07-19 17:00:00',
               '2015-07-19 18:00:00', '2015-07-19 19:00:00',
               '2015-07-19 20:00:00', '2015-07-19 21:00:00',
               '2015-07-19 22:00:00', '2015-07-19 23:00:00'],
              dtype='datetime64[ns]', name=u'hour', length=3360, freq=None, tz=None)

如您所见,'freq'为None . 我想知道如何检测此系列的频率并将'freq'设置为其频率 .

如果可能的话,我希望这可以在数据不连续的情况下工作(系列中有很多中断) .

我试图找到两个时间戳之间所有差异的模式,但我不知道如何将其转换为系列可读的格式

3 回答

  • 3

    也许尝试区分时间索引并使用模式(或最小差异)作为频率 .

    import pandas as pd
    import numpy as np
    
    # simulate some data
    # ===================================
    np.random.seed(0)
    dt_rng = pd.date_range('2015-03-02 00:00:00', '2015-07-19 23:00:00', freq='H')
    dt_idx = pd.DatetimeIndex(np.random.choice(dt_rng, size=2000, replace=False))
    df = pd.DataFrame(np.random.randn(2000), index=dt_idx, columns=['col']).sort_index()
    df
    
                            col
    2015-03-02 01:00:00  2.0261
    2015-03-02 04:00:00  1.3325
    2015-03-02 05:00:00 -0.9867
    2015-03-02 06:00:00 -0.0671
    2015-03-02 08:00:00 -1.1131
    2015-03-02 09:00:00  0.0494
    2015-03-02 10:00:00 -0.8130
    2015-03-02 11:00:00  1.8453
    ...                     ...
    2015-07-19 13:00:00 -0.4228
    2015-07-19 14:00:00  1.1962
    2015-07-19 15:00:00  1.1430
    2015-07-19 16:00:00 -1.0080
    2015-07-19 18:00:00  0.4009
    2015-07-19 19:00:00 -1.8434
    2015-07-19 20:00:00  0.5049
    2015-07-19 23:00:00 -0.5349
    
    [2000 rows x 1 columns]
    
    # processing
    # ==================================
    # the gap distribution
    res = (pd.Series(df.index[1:]) - pd.Series(df.index[:-1])).value_counts()
    
    01:00:00    1181
    02:00:00     499
    03:00:00     180
    04:00:00      93
    05:00:00      24
    06:00:00      10
    07:00:00       9
    08:00:00       3
    dtype: int64
    
    # the mode can be considered as frequency
    res.index[0]  # output: Timedelta('0 days 01:00:00')
    # or maybe the smallest difference
    res.index.min()  # output: Timedelta('0 days 01:00:00')
    
    
    
    
    # get full datetime rng
    full_rng = pd.date_range(df.index[0], df.index[-1], freq=res.index[0])
    full_rng
    
    DatetimeIndex(['2015-03-02 01:00:00', '2015-03-02 02:00:00',
                   '2015-03-02 03:00:00', '2015-03-02 04:00:00',
                   '2015-03-02 05:00:00', '2015-03-02 06:00:00',
                   '2015-03-02 07:00:00', '2015-03-02 08:00:00',
                   '2015-03-02 09:00:00', '2015-03-02 10:00:00', 
                   ...
                   '2015-07-19 14:00:00', '2015-07-19 15:00:00',
                   '2015-07-19 16:00:00', '2015-07-19 17:00:00',
                   '2015-07-19 18:00:00', '2015-07-19 19:00:00',
                   '2015-07-19 20:00:00', '2015-07-19 21:00:00',
                   '2015-07-19 22:00:00', '2015-07-19 23:00:00'],
                  dtype='datetime64[ns]', length=3359, freq='H', tz=None)
    
  • 3

    找到最小时差

    np.diff(data.index.values).min()
    

    通常以ns为单位 . 要获得频率,假设ns:

    freq = 1e9 / np.diff(df.index.values).min().astype(int)
    
  • 5

    值得一提的是,如果数据是连续的,您可以使用pandas.DateTimeIndex.inferred_freq属性:

    dt_ix = pd.date_range('2015-03-02 00:00:00', '2015-07-19 23:00:00', freq='H')
    dt_ix._set_freq(None)
    dt_ix.inferred_freq
    Out[2]: 'H'
    

    pandas.infer_freq方法:

    pd.infer_freq(dt_ix)
    Out[3]: 'H'
    

    如果不连续,pandas.infer_freq将返回None . 与已经提出的方法类似,另一种方法是使用pandas.Series.diff方法:

    split_ix = dt_ix.drop(pd.date_range('2015-05-01 00:00:00','2015-05-30 00:00:00', freq='1H'))
    split_ix.to_series().diff().min()
    Out[4]: Timedelta('0 days 01:00:00')
    

相关问题