Python熊猫重新取样-Java 学习之路

我有以下数据帧：

Timestamp    S_time1   S_time2   End_Time_1   End_time_2   Sign_1   Sign_2
0    2413044       0        0           0            0          x        x
1    2422476       0        0           0            0          x        x
2    2431908       0        0           0            0          x        x
3    2441341       0        0           0            0          x        x
4    2541232   2526631   2528631     2520631      2530631      10       80
5    2560273   2544946   2546496     2546496      2548496      40       80
6    2577224   2564010   2566010     2566010      2568010     null    null
7    2592905   2580959   2582959     2582959      2584959     null    null

table 就这样继续下去 . 第一列是时间戳，以毫秒为单位 . S_time1 和 End_time_1 是特定符号（数字）出现的持续时间 . 例如，如果我们取第5行， S_time1 是2526631， End_time_1 是2520631，相应的 sign_1 是10，这意味着从2526631到2520631将显示符号10 . 同样的事情发生在 S_time2 和 End_time_2 . sign_2 中的相应值将显示在从 S_time2 到 End_time_2 的持续时间中 .

我想在100毫秒的bin时间内重新采样索引列（ Timestamp ）并检查符号所属的bin时间 . 例如，在每个开始时间和结束时间之间存在2000毫秒的差异 . 因此，相应的符号编号将在大约20个连续的bin时间内重复出现，因为每个bin时间是100毫秒 . 所以我只需要两列：一列是bin时间，第二列是符号 . 看起来像下表:(我只是为了弥补bin时间）

Bin_time   signs
...100        0
...200        0
...300        10
...400        10
...500        10
...600        10

符号10将持续相应的S_time1到End_time_1的持续时间 . 然后，下一个符号80继续S_time2到End_time_2的持续时间 . 我不确定这是否可以在熊猫中完成 . 但我真的需要大熊猫或其他方法的帮助 .

感谢您的帮助和建议 .

1 回答

输入：

print df
  Timestamp  S_time1  S_time2  End_Time_1  End_time_2 Sign_1 Sign_2
0    2413044        0        0           0           0      x      x
1    2422476        0        0           0           0      x      x
2    2431908        0        0           0           0      x      x
3    2441341        0        0           0           0      x      x
4    2541232  2526631  2528631     2520631     2530631     10     80
5    2560273  2544946  2546496     2546496     2548496     40     80
6    2577224  2564010  2566010     2566010     2568010   null   null
7    2592905  2580959  2582959     2582959     2584959   null   null

2种方法：

In [231]: %timeit s(df)
1 loops, best of 3: 2.78 s per loop

In [232]: %timeit m(df)
1 loops, best of 3: 690 ms per loop

def m(df):
    #resample column Timestamp by 100ms, convert bak to integers 
    df['Timestamp'] = df['Timestamp'].astype('timedelta64[ms]')
    df['i'] = 1
    df = df.set_index('Timestamp')
    df1 = df[[]].resample('100ms', how='first').reset_index()
    df1['Timestamp'] = (df1['Timestamp'] / np.timedelta64(1, 'ms')).astype(int)
    #felper column i for merging
    df1['i'] = 1
    #print df1

    out = df1.merge(df,on='i', how='left')
    out1 = out[['Timestamp', 'Sign_1']][(out.Timestamp >= out.S_time1) & (out.Timestamp <= out.End_Time_1)]
    out2 = out[['Timestamp', 'Sign_2']][(out.Timestamp >= out.S_time2) & (out.Timestamp <= out.End_time_2)]

    out1 = out1.rename(columns={'Sign_1':'Bin_time'})
    out2 = out2.rename(columns={'Sign_2':'Bin_time'})

    df = pd.concat([out1, out2], ignore_index=True).drop_duplicates(subset='Timestamp')
    df1 = df1.set_index('Timestamp')
    df = df.set_index('Timestamp')
    df = df.reindex(df1.index).reset_index()
    #print df.head(10)

def s(df):
    #resample column Timestamp by 100ms, convert bak to integers 
    df['Timestamp'] = df['Timestamp'].astype('timedelta64[ms]')
    df = df.set_index('Timestamp')
    out = df[[]].resample('100ms', how='first')
    out = out.reset_index()
    out['Timestamp'] = (out['Timestamp'] / np.timedelta64(1, 'ms')).astype(int)
    #print out.head(10)

    #search start end 
    def search(x):
        mask1 = (df.S_time1<=x['Timestamp']) & (df.End_Time_1>=x['Timestamp'])
        #if at least one True return first value of series
        if mask1.any():
                return df.loc[mask1].Sign_1[0]
        #check second start and end time
        else:
                mask2 = (df.S_time2<=x['Timestamp']) & (df.End_time_2>=x['Timestamp'])
                if mask2.any():
                    #if at least one True return first value
                    return df.loc[mask2].Sign_2[0]
                else:
                    #if all False return NaN
                    return np.nan

    out['Bin_time'] = out.apply(search, axis=1)
    #print out.head(10)

回复于 2024-05-03T23:40:40+08:00

Python熊猫重新取样

1 回答

相关问题