首页 文章

按天计算计数

提问于
浏览
2

我有一个数据框,其列 created_atentities 如下所示

created_at                         entities
2017-10-29 23:06:28     {'hashtags': [{'text': 'OPEC', 'indices': [0, ...
2017-10-29 22:28:20     {'hashtags': [{'text': 'Iraq', 'indices': [21,...
2017-10-29 20:01:37     {'hashtags': [{'text': 'oil', 'indices': [58, ...
2017-10-29 20:00:14     {'hashtags': [{'text': 'oil', 'indices': [38, ...
2017-10-27 08:44:30     {'hashtags': [{'text': 'Iran', 'indices': [19,...
2017-10-27 08:44:10     {'hashtags': [{'text': 'Oil', 'indices': [17, ...
2017-10-27 08:43:13     {'hashtags': [{'text': 'Oil', 'indices': [0, 4...
2017-10-27 08:43:00     {'hashtags': [{'text': 'Iran', 'indices': [19,.

我想计算每一天的实体数量 . 基本上我想收到类似的东西

created_at    number_of_entities
2017-10-29           4
2017-10-27           4

怎么做?我有 pandas 0.23.4

5 回答

  • 2

    您可以使用floordate删除时间,然后使用value_counts进行计数,将rename_axisreset_index用于2列 DataFrame

    df = (df['created_at'].dt.floor('d')
                         .value_counts()
                         .rename_axis('created_at')
                         .reset_index(name='number_of_entities'))
    print (df)
      created_at  number_of_entities
    0 2017-10-29                   4
    1 2017-10-27                   4
    

    要么:

    df = (df['created_at'].dt.date
                         .value_counts()
                         .rename_axis('created_at')
                         .reset_index(name='number_of_entities'))
    

    如果想避免在 value_counts 传递参数 sort=False 中进行默认排序:

    df = (df['created_at'].dt.floor('d')
                         .value_counts(sort=False)
                         .rename_axis('created_at')
                         .reset_index(name='number_of_entities'))
    
  • 2

    使用 groupby.size

    # Convert to datetime dtype if you haven't.
    df1.created_at = pd.to_datetime(df1.created_at)
    
    df2 = df1.groupby(df1.created_at.dt.date).size().reset_index(name='number_of_entities')
    
    print (df2)
    
       created_at  number_of_entities
    0  2017-10-27                   4
    1  2017-10-29                   4
    
  • 3

    特定

    >>> df
               created_at  entities
    0 2017-10-29 23:06:28         1
    1 2017-10-29 22:28:20         2
    2 2017-10-29 20:01:37         3
    3 2017-10-29 20:00:14         4
    4 2017-10-27 08:44:30         5
    5 2017-10-27 08:44:10         6
    6 2017-10-27 08:43:13         7
    7 2017-10-27 08:43:00         8
    

    >>> df.dtypes
    created_at    datetime64[ns]
    entities               int64
    dtype: object
    

    你可以发出:

    >>> pd.PeriodIndex(df['created_at'], freq='D').value_counts()
    2017-10-29    4
    2017-10-27    4
    Freq: D, Name: created_at, dtype: int64
    

    jezrael在评论中提出了一种没有 PeriodIndex 构造函数的更好的方法:

    >>> df['created_at'].dt.to_period('D').value_counts()
    2017-10-27    4
    2017-10-29    4
    

    通过一些额外的重命名来匹配您的输出,它开始看起来像jezrael的解决方案 . ;)

    >>> datecol = 'created_at'
    >>> df[datecol].dt.to_period('D').value_counts().rename_axis(datecol).reset_index(name='number_of_entities')
      created_at  number_of_entities
    0 2017-10-27                   4
    1 2017-10-29                   4
    

    或者,您可以将索引设置为日期,然后 resample

    >>> df.set_index('created_at').resample('D').size()
    created_at
    2017-10-27    4
    2017-10-28    0
    2017-10-29    4
    Freq: D, dtype: int64
    

    ...如果需要转换为您的确切输出:

    >>> resampled = df.set_index('created_at').resample('D').size()
    >>> resampled[resampled != 0].reset_index().rename(columns={0: 'number_of_entities'})
      created_at  number_of_entities
    0 2017-10-27                   4
    1 2017-10-29                   4
    

    更多上下文: resample 对于任意时间间隔特别有用,例如"five minutes" . 以下示例直接来自Wes McKinney的书"Python for Data Analysis" .

    >>> N = 15
    >>> times = pd.date_range('2017-05-20 00:00', freq='1min', periods=N)
    >>> df = pd.DataFrame({'time': times, 'value': np.arange(N)})
    >>> 
    >>> df
                      time  value
    0  2017-05-20 00:00:00      0
    1  2017-05-20 00:01:00      1
    2  2017-05-20 00:02:00      2
    3  2017-05-20 00:03:00      3
    4  2017-05-20 00:04:00      4
    5  2017-05-20 00:05:00      5
    6  2017-05-20 00:06:00      6
    7  2017-05-20 00:07:00      7
    8  2017-05-20 00:08:00      8
    9  2017-05-20 00:09:00      9
    10 2017-05-20 00:10:00     10
    11 2017-05-20 00:11:00     11
    12 2017-05-20 00:12:00     12
    13 2017-05-20 00:13:00     13
    14 2017-05-20 00:14:00     14
    >>> 
    >>> df.set_index('time').resample('5min').size()
    time
    2017-05-20 00:00:00    5
    2017-05-20 00:05:00    5
    2017-05-20 00:10:00    5
    Freq: 5T, dtype: int64
    
  • 1

    给你数据:

    In [3]: df
    Out[3]: 
                created_at                                           entities
    0  2017-10-29 23:06:28  {'hashtags': [{'text': 'OPEC', 'indices': [0, ...
    1  2017-10-29 22:28:20  {'hashtags': [{'text': 'Iraq', 'indices': [21,...
    2  2017-10-29 20:01:37  {'hashtags': [{'text': 'oil', 'indices': [58, ...
    3  2017-10-29 20:00:14  {'hashtags': [{'text': 'oil', 'indices': [38, ...
    4  2017-10-27 08:44:30  {'hashtags': [{'text': 'Iran', 'indices': [19,...
    5  2017-10-27 08:44:10  {'hashtags': [{'text': 'Oil', 'indices': [17, ...
    6  2017-10-27 08:43:13  {'hashtags': [{'text': 'Oil', 'indices': [0, 4...
    7  2017-10-27 08:43:00    {'hashtags': [{'text': 'Iran', 'indices': [19,.
    

    您可以按如下方式使用groupby(..).count()来获得您想要的内容:

    In [4]: df[["created_at"]].groupby(pd.to_datetime(df["created_at"]).dt.date).count().rename(columns={"created_at":"number_of_entities"}).reset_index()
        ...: 
    Out[4]: 
       created_at  number_of_entities
    0  2017-10-27                   4
    1  2017-10-29                   4
    

    Notice that:

    如果 created_at 列已经采用日期时间格式,则只需使用以下内容:

    df[["created_at"]].groupby(df.created_at.dt.date).count().rename(columns={"created_at":"number_of_entities"}).reset_index()
    
  • 2

    您可以使用 df.groupby(df.created_at.dt.day) 按天分组 .

    至于计算计数的函数,为此我们需要一个完整的行,你的数据结构看起来很奇怪 .

相关问题