首页 文章

基于时间戳不完全匹配的pandas合并

提问于
浏览
6

有哪些方法可以合并时间戳不完全匹配的列?

DF1:

date    start_time  employee_id session_id
01/01/2016  01/01/2016 06:03:13 7261824 871631182

DF2:

date    start_time  employee_id session_id
01/01/2016  01/01/2016 06:03:37 7261824 871631182

我可以加入['date','employee_id','session_id'],但有时同一个员工在同一天会有多个相同的会话,这会导致重复 . 我可以删除发生这种情况的行,但如果我这样做,我将失去有效的会话 .

如果DF1的时间戳距离DF2的时间戳<5分钟,并且session_id和employee_id也匹配,是否有一种有效的加入方式?如果存在匹配记录,则时间戳将始终稍晚于DF1,因为事件在将来某个时间点触发 .

['employee_id', 'session_id', 'timestamp<5minutes']

Edit - 我以为有人会遇到过这个问题 .

我在想这样做:

  • 在每个数据帧上记录我的时间戳

  • 创建一个时间戳为5分钟的列(四舍五入)

  • 创建一个时间戳列 - 5分钟(舍入)

  • 创建一个10分钟的间隔字符串以加入文件

df1 ['low_time'] = df1 ['start_time'] - timedelta(分钟= 5)
df1 ['high_time'] = df1 ['start_time'] timedelta(分钟= 5)
df1 ['interval_string'] = df1 ['low_time'] . astype(str)df1 ['high_time'] . astype(str)

有人知道如何将这5分钟的间隔绕到最近的5分钟标记处吗?

02:59:37 - 5分钟= 02:55:00

02:59:37 5分= 03:05:00

interval_string = '02:55:00-03:05:00'

pd.merge(df1, df2, how = 'left', on = ['employee_id', 'session_id', 'date', 'interval_string']

有谁知道怎么绕这样的时间?这似乎可行 . 您仍然根据日期,员工和会话进行匹配,然后查找基本上在相同的10分钟间隔或范围内的时间

2 回答

  • 2

    考虑以下迷你版本的问题:

    from io import StringIO
    from pandas import read_csv, to_datetime
    
    # how close do sessions have to be to be considered equal? (in minutes)
    threshold = 5
    
    # datetime column (combination of date + start_time)
    dtc = [['date', 'start_time']]
    
    # index column (above combination)
    ixc = 'date_start_time'
    
    df1 = read_csv(StringIO(u'''
    date,start_time,employee_id,session_id
    01/01/2016,02:03:00,7261824,871631182
    01/01/2016,06:03:00,7261824,871631183
    01/01/2016,11:01:00,7261824,871631184
    01/01/2016,14:01:00,7261824,871631185
    '''), parse_dates=dtc)
    
    df2 = read_csv(StringIO(u'''
    date,start_time,employee_id,session_id
    01/01/2016,02:03:00,7261824,871631182
    01/01/2016,06:05:00,7261824,871631183
    01/01/2016,11:04:00,7261824,871631184
    01/01/2016,14:10:00,7261824,871631185
    '''), parse_dates=dtc)
    

    这使

    >>> df1
          date_start_time  employee_id  session_id
    0 2016-01-01 02:03:00      7261824   871631182
    1 2016-01-01 06:03:00      7261824   871631183
    2 2016-01-01 11:01:00      7261824   871631184
    3 2016-01-01 14:01:00      7261824   871631185
    >>> df2
          date_start_time  employee_id  session_id
    0 2016-01-01 02:03:00      7261824   871631182
    1 2016-01-01 06:05:00      7261824   871631183
    2 2016-01-01 11:04:00      7261824   871631184
    3 2016-01-01 14:10:00      7261824   871631185
    

    您希望在合并时将 df2[0:3] 视为 df1[0:3] 的重复项(因为它们分别相隔不到5分钟),但将 df1[3]df2[3] 视为单独的会话 .

    解决方案1:间隔匹配

    这基本上就是您在编辑中建议的内容 . 您希望将两个表中的时间戳映射到以时间戳为中心的10分钟间隔,四舍五入到最接近的5分钟 .

    每个间隔可以通过其中点唯一地表示,因此您可以将时间戳上的数据帧合并到最接近的5分钟 . 例如:

    import numpy as np
    
    # half-threshold in nanoseconds
    threshold_ns = threshold * 60 * 1e9
    
    # compute "interval" to which each session belongs
    df1['interval'] = to_datetime(np.round(df1.date_start_time.astype(np.int64) / threshold_ns) * threshold_ns)
    df2['interval'] = to_datetime(np.round(df2.date_start_time.astype(np.int64) / threshold_ns) * threshold_ns)
    
    # join
    cols = ['interval', 'employee_id', 'session_id']
    print df1.merge(df2, on=cols, how='outer')[cols]
    

    打印

    interval  employee_id  session_id
    0 2016-01-01 02:05:00      7261824   871631182
    1 2016-01-01 06:05:00      7261824   871631183
    2 2016-01-01 11:00:00      7261824   871631184
    3 2016-01-01 14:00:00      7261824   871631185
    4 2016-01-01 11:05:00      7261824   871631184
    5 2016-01-01 14:10:00      7261824   871631185
    

    请注意,这不完全正确 . 会话 df1[2]df2[2] 不会被视为重复,尽管它们相距仅3分钟 . 这是因为它们位于区间边界的不同侧 .

    解决方案2:一对一匹配

    这是另一种方法,它取决于 df1 中的会话在 df2 中具有零或一个重复的条件 .

    我们将 df1 中的时间戳替换为 df2 中与 employee_id 匹配的最接近的时间戳,并且 session_id and 距离不到5分钟 .

    from datetime import timedelta
    
    # get closest match from "df2" to row from "df1" (as long as it's below the threshold)
    def closest(row):
        matches = df2.loc[(df2.employee_id == row.employee_id) &
                          (df2.session_id == row.session_id)]
    
        deltas = matches.date_start_time - row.date_start_time
        deltas = deltas.loc[deltas <= timedelta(minutes=threshold)]
    
        try:
            return matches.loc[deltas.idxmin()]
        except ValueError:  # no items
            return row
    
    # replace timestamps in "df1" with closest timestamps in "df2"
    df1 = df1.apply(closest, axis=1)
    
    # join
    cols = ['date_start_time', 'employee_id', 'session_id']
    print df1.merge(df2, on=cols, how='outer')[cols]
    

    打印

    date_start_time  employee_id  session_id
    0 2016-01-01 02:03:00      7261824   871631182
    1 2016-01-01 06:05:00      7261824   871631183
    2 2016-01-01 11:04:00      7261824   871631184
    3 2016-01-01 14:01:00      7261824   871631185
    4 2016-01-01 14:10:00      7261824   871631185
    

    这种方法要慢得多,因为你必须为 df1 中的每一行搜索整个 df2 . 我写的内容可能会进一步优化,但这仍需要很长时间才能完成大型数据集 .

  • 3

    我会尝试在熊猫中使用这个方法:

    pandas.merge_asof()

    你感兴趣的参数是 directiontoleranceleft_onright_on

    Build @Igor答案:

    import pandas as pd
    from pandas import read_csv
    from io import StringIO
    
    # datetime column (combination of date + start_time)
    dtc = [['date', 'start_time']]
    
    # index column (above combination)
    ixc = 'date_start_time'
    
    df1 = read_csv(StringIO(u'''
    date,start_time,employee_id,session_id
    01/01/2016,02:03:00,7261824,871631182
    01/01/2016,06:03:00,7261824,871631183
    01/01/2016,11:01:00,7261824,871631184
    01/01/2016,14:01:00,7261824,871631185
    '''), parse_dates=dtc)
    
    df2 = read_csv(StringIO(u'''
    date,start_time,employee_id,session_id
    01/01/2016,02:03:00,7261824,871631182
    01/01/2016,06:05:00,7261824,871631183
    01/01/2016,11:04:00,7261824,871631184
    01/01/2016,14:10:00,7261824,871631185
    '''), parse_dates=dtc)
    
    
    
    df1['date_start_time'] = pd.to_datetime(df1['date_start_time'])
    df2['date_start_time'] = pd.to_datetime(df2['date_start_time'])
    
    # converting this to the index so we can preserve the date_start_time columns so you can validate the merging logic
    df1.index = df1['date_start_time']
    df2.index = df2['date_start_time']
    # the magic happens below, check the direction and tolerance arguments
    tol = pd.Timedelta('5 minute')
    pd.merge_asof(left=df1,right=df2,right_index=True,left_index=True,direction='nearest',tolerance=tol)
    

    输出

    date_start_time date_start_time_x   employee_id_x   session_id_x    date_start_time_y   employee_id_y   session_id_y
    
    2016-01-01 02:03:00 2016-01-01 02:03:00 7261824 871631182   2016-01-01 02:03:00 7261824.0   871631182.0
    2016-01-01 06:03:00 2016-01-01 06:03:00 7261824 871631183   2016-01-01 06:05:00 7261824.0   871631183.0
    2016-01-01 11:01:00 2016-01-01 11:01:00 7261824 871631184   2016-01-01 11:04:00 7261824.0   871631184.0
    2016-01-01 14:01:00 2016-01-01 14:01:00 7261824 871631185   NaT NaN NaN
    

相关问题