首页 文章

将Pandas GroupBy对象转换为DataFrame

提问于
浏览
325

我从这样的输入数据开始

df1 = pandas.DataFrame( { 
    "Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] , 
    "City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"] } )

打印时显示如下:

City     Name
0   Seattle    Alice
1   Seattle      Bob
2  Portland  Mallory
3   Seattle  Mallory
4   Seattle      Bob
5  Portland  Mallory

分组很简单:

g1 = df1.groupby( [ "Name", "City"] ).count()

和打印产生一个 GroupBy 对象:

City  Name
Name    City
Alice   Seattle      1     1
Bob     Seattle      2     2
Mallory Portland     2     2
        Seattle      1     1

但我最终想要的是另一个包含GroupBy对象中所有行的DataFrame对象 . 换句话说,我希望得到以下结果:

City  Name
Name    City
Alice   Seattle      1     1
Bob     Seattle      2     2
Mallory Portland     2     2
Mallory Seattle      1     1

我无法在pandas文档中看到如何实现这一点 . 任何提示都会受到欢迎 .

8 回答

  • 89

    简单来说,这应该完成任务:

    import pandas as pd
    
    grouped_df = df1.groupby( [ "Name", "City"] )
    
    pd.DataFrame(grouped_df.size().reset_index(name = "Group_Count"))
    

    这里,grouped_df.size()提取唯一的groupby计数,reset_index()方法重置你想要的列的名称 . 最后,调用pandas Dataframe()函数来创建DataFrame对象 .

  • 0

    我发现这对我有用 .

    import numpy as np
    import pandas as pd
    
    df1 = pd.DataFrame({ 
        "Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] , 
        "City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"]})
    
    df1['City_count'] = 1
    df1['Name_count'] = 1
    
    df1.groupby(['Name', 'City'], as_index=False).count()
    
  • 5

    我想略微改变Wes给出的答案,因为版本0.16.2需要 as_index=False . 如果不设置它,则会得到一个空数据帧 .

    Source

    当as_index = True时,聚合函数将不会返回聚合的组(如果它们是命名列),则默认值为 . 分组列将是返回对象的索引 . 传递as_index = False将返回您聚合的组(如果它们是命名列) . 聚合函数是减少返回对象的维度的函数,例如:mean,sum,size,count,std,var,sem,describe,first,last,nth,min,max . 当您执行DataFrame.sum()并返回Series时会发生这种情况 . 第n个可以作为减速器或过滤器,请参见此处 .

    import pandas as pd
    
    df1 = pd.DataFrame({"Name":["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"],
                        "City":["Seattle","Seattle","Portland","Seattle","Seattle","Portland"]})
    print df1
    #
    #       City     Name
    #0   Seattle    Alice
    #1   Seattle      Bob
    #2  Portland  Mallory
    #3   Seattle  Mallory
    #4   Seattle      Bob
    #5  Portland  Mallory
    #
    g1 = df1.groupby(["Name", "City"], as_index=False).count()
    print g1
    #
    #                  City  Name
    #Name    City
    #Alice   Seattle      1     1
    #Bob     Seattle      2     2
    #Mallory Portland     2     2
    #        Seattle      1     1
    #
    

    编辑:

    在版本 0.17.1 及更高版本中,您可以在count中使用 subset ,在size中使用参数 namereset_index

    print df1.groupby(["Name", "City"], as_index=False ).count()
    #IndexError: list index out of range
    
    print df1.groupby(["Name", "City"]).count()
    #Empty DataFrame
    #Columns: []
    #Index: [(Alice, Seattle), (Bob, Seattle), (Mallory, Portland), (Mallory, Seattle)]
    
    print df1.groupby(["Name", "City"])[['Name','City']].count()
    #                  Name  City
    #Name    City                
    #Alice   Seattle      1     1
    #Bob     Seattle      2     2
    #Mallory Portland     2     2
    #        Seattle      1     1
    
    print df1.groupby(["Name", "City"]).size().reset_index(name='count')
    #      Name      City  count
    #0    Alice   Seattle      1
    #1      Bob   Seattle      2
    #2  Mallory  Portland      2
    #3  Mallory   Seattle      1
    

    countsize 之间的区别在于 size 计算NaN值,而 count 则不计算 .

  • 5

    这些解决方案只对我有用,因为我正在进行多次聚合 . 以下是我想要转换为数据帧的分组示例输出:

    Groupby Output

    因为我想要的数量超过reset_index()提供的数量,所以我写了一个手动方法,将上面的图像转换为数据帧 . 我知道这不是最狡猾/大熊猫的方式,因为它非常冗长和明确,但它只是我需要的 . 基本上,使用上面解释的reset_index()方法启动“脚手架”数据框,然后循环分组数据框中的组配对,检索索引,针对未分组的数据帧执行计算,并在新的聚合数据框中设置值 .

    df_grouped = df[['Salary Basis', 'Job Title', 'Hourly Rate', 'Male Count', 'Female Count']]
    df_grouped = df_grouped.groupby(['Salary Basis', 'Job Title'], as_index=False)
    
    # Grouped gives us the indices we want for each grouping
    # We cannot convert a groupedby object back to a dataframe, so we need to do it manually
    # Create a new dataframe to work against
    df_aggregated = df_grouped.size().to_frame('Total Count').reset_index()
    df_aggregated['Male Count'] = 0
    df_aggregated['Female Count'] = 0
    df_aggregated['Job Rate'] = 0
    
    def manualAggregations(indices_array):
        temp_df = df.iloc[indices_array]
        return {
            'Male Count': temp_df['Male Count'].sum(),
            'Female Count': temp_df['Female Count'].sum(),
            'Job Rate': temp_df['Hourly Rate'].max()
        }
    
    for name, group in df_grouped:
        ix = df_grouped.indices[name]
        calcDict = manualAggregations(ix)
    
        for key in calcDict:
            #Salary Basis, Job Title
            columns = list(name)
            df_aggregated.loc[(df_aggregated['Salary Basis'] == columns[0]) & 
                              (df_aggregated['Job Title'] == columns[1]), key] = calcDict[key]
    

    如果字典不是你的东西,计算可以在for循环中内联应用:

    df_aggregated['Male Count'].loc[(df_aggregated['Salary Basis'] == columns[0]) & 
                                    (df_aggregated['Job Title'] == columns[1])] = df['Male Count'].iloc[ix].sum()
    
  • 10

    以下解决方案可能更简单:

    df1.reset_index().groupby( [ "Name", "City"],as_index=False ).count()
    
  • 3

    我已经汇总了数量明智的数据并存储到数据帧

    almo_grp_data = pd.DataFrame({'Qty_cnt' :
    almo_slt_models_data.groupby( ['orderDate','Item','State Abv']
              )['Qty'].sum()}).reset_index()
    
  • 386

    g1 这是一个DataFrame . 它有一个分层索引,但是:

    In [19]: type(g1)
    Out[19]: pandas.core.frame.DataFrame
    
    In [20]: g1.index
    Out[20]: 
    MultiIndex([('Alice', 'Seattle'), ('Bob', 'Seattle'), ('Mallory', 'Portland'),
           ('Mallory', 'Seattle')], dtype=object)
    

    也许你想要这样的东西?

    In [21]: g1.add_suffix('_Count').reset_index()
    Out[21]: 
          Name      City  City_Count  Name_Count
    0    Alice   Seattle           1           1
    1      Bob   Seattle           2           2
    2  Mallory  Portland           2           2
    3  Mallory   Seattle           1           1
    

    或类似的东西:

    In [36]: DataFrame({'count' : df1.groupby( [ "Name", "City"] ).size()}).reset_index()
    Out[36]: 
          Name      City  count
    0    Alice   Seattle      1
    1      Bob   Seattle      2
    2  Mallory  Portland      2
    3  Mallory   Seattle      1
    
  • 1

    也许我误解了这个问题,但如果你想将groupby转换回数据帧,你可以使用.to_frame() . 当我这样做时我想重置索引,所以我也包括了那个部分 .

    示例代码与问题无关

    df = df['TIME'].groupby(df['Name']).min()
    df = df.to_frame()
    df = df.reset_index(level=['Name',"TIME"])
    

相关问题