首页 文章

重新采样开始和结束日期列

提问于
浏览
1

我有一个如下所示的数据框:

START_TIME   END_TIME     TRIAL_No        itemnr
 2403950      2413067      Trial: 1        P14
 2413378      2422499      Trial: 2        P03
 2422814      2431931      Trial: 3        P13
 2432246      2441363      Trial: 4        P02
 2523540      2541257      Trial: 5        P11
 2541864      2560297      Trial: 6        P10
 2560916      2577249      Trial: 7        P05

table 一直在继续 . START_TIME和END_TIME都以毫秒为单位,这是试验的开始和结束时间 . 所以我想要做的是,我想将START_TIME重新采样到100毫秒bin itme并在每个START_TIME和END_TIME之间插入变量(TRIAL_No和itemnr) . 在这些区域之外,这些变量应具有值“NA” . 例如,对于第一行,START_TIME是2403950,END_TIME是2413067.它们之间的差异是9117毫秒 . 因此,“试验:1”保持9117毫秒,这是因为每个箱时间相隔100毫秒,因此需要91个箱时间 . 所以我想在结果数据帧中重复“Trial_1”和“P14”91次 . 其余部分也是如此 . 看起来如下:

Bin_time     TRIAL_No    itemnr
2403950      Trial: 1    P14
2404050      Trial: 1    P14
2404150      Trial: 1    P14
            ...
2413050      Trial: 1    P14
2413150      Trial: 2    P03
2413250      Trial: 2    P03

等等 . 我不确定是否可以直接在熊猫中进行,或者需要进行一些预处理 .

1 回答

  • 1

    通过concat dataframes创建新数据框后,我可以按行对其进行分组,并在每个组上应用resample(使用方法 ffill 转发填充) .

    print df
    #   START_TIME  END_TIME  TRIAL_No itemnr
    #0     2403950   2413067  Trial: 1    P14
    #1     2413378   2422499  Trial: 2    P03
    #2     2422814   2431931  Trial: 3    P13
    #3     2432246   2441363  Trial: 4    P02
    #4     2523540   2541257  Trial: 5    P11
    #5     2541864   2560297  Trial: 6    P10
    #6     2560916   2577249  Trial: 7    P05
    
    #PREDPROCESSING
    #helper column for matching start and end rows
    df['row'] = range(len(df))
    
    #reshape to df - every row two times repeated for each date of START_TIME and END_TIME
    starts = df[['START_TIME','TRIAL_No','itemnr','row']].rename(columns={'START_TIME':'Bin_time'})
    ends = df[['END_TIME','TRIAL_No','itemnr','row']].rename(columns={'END_TIME':'Bin_time'})
    df = pd.concat([starts, ends])
    df = df.set_index('row', drop=True)
    df = df.sort_index()
    
    #convert miliseconds to timedelta for resampling by time 100ms
    df['Bin_time'] = df['Bin_time'].astype('timedelta64[ms]')
    
    print df
    #           Bin_time  TRIAL_No itemnr
    #row                                 
    #0   00:40:03.950000  Trial: 1    P14
    #0   00:40:13.067000  Trial: 1    P14
    #1   00:40:13.378000  Trial: 2    P03
    #1   00:40:22.499000  Trial: 2    P03
    #2   00:40:22.814000  Trial: 3    P13
    #2   00:40:31.931000  Trial: 3    P13
    #3   00:40:32.246000  Trial: 4    P02
    #3   00:40:41.363000  Trial: 4    P02
    #4   00:42:03.540000  Trial: 5    P11
    #4   00:42:21.257000  Trial: 5    P11
    #5   00:42:21.864000  Trial: 6    P10
    #5   00:42:40.297000  Trial: 6    P10
    #6   00:42:40.916000  Trial: 7    P05
    #6   00:42:57.249000  Trial: 7    P05
    
    print df.dtypes
    #Bin_time    timedelta64[ms]
    #TRIAL_No             object
    #itemnr               object
    #dtype: object
    
    #resample and fill missing data 
    df = df.groupby(df.index).apply(lambda x: x.set_index('Bin_time').resample('100ms',how='first',fill_method='ffill'))
    
    df = df.reset_index()
    df = df.drop(['row'], axis=1)
    
    #convert timedelta to integer back
    df['Bin_time'] = (df['Bin_time'] / np.timedelta64(1, 'ms')).astype(int)
    
    print df.head()
    #  Bin_time  TRIAL_No itemnr
    #0  2403950  Trial: 1    P14
    #1  2404050  Trial: 1    P14
    #2  2404150  Trial: 1    P14
    #3  2404250  Trial: 1    P14
    #4  2404350  Trial: 1    P14
    

    编辑:

    如果你想在组外获得 NaN ,你可以在 groupby 之后更改代码:

    #resample and fill missing data 
    df = df.groupby(df.index).apply(lambda x: x.set_index('Bin_time').resample('100ms', how='first',fill_method='ffill'))
    
    #reset only first level - drop index row
    df = df.reset_index(level=0, drop=True)
    #resample by 100ms, outside are NaN
    df = df.resample('100ms', how='first')
    df = df.reset_index()
    #convert timedelta to integer back
    df['Bin_time'] = (df['Bin_time'] / np.timedelta64(1, 'ms')).astype(int)
    
    print df
    

相关问题