首页 文章

Pyspark groupBy Pivot Transformation

提问于
浏览
0

我很难构建以下Pyspark数据帧操作 .

基本上我试图按类别分组,然后转动/取消融合子类别并添加新列 .

enter image description here

我已经尝试了很多方法,但它们非常慢并且没有利用Spark的并行性 .

这是我现有的(慢速,详细)代码:

from pyspark.sql.functions import lit

df = sqlContext.table('Table')

#loop over category
listids = [x.asDict().values()[0] for x in df.select("category").distinct().collect()]
dfArray = [df.where(df.category == x) for x in listids]
for d in dfArray:
  #loop over subcategory
  listids_sub = [x.asDict().values()[0] for x in d.select("sub_category").distinct().collect()]
  dfArraySub = [d.where(d.sub_category == x) for x in listids_sub]
  num = 1

  for b in dfArraySub:
    #renames all columns to append a number
    for c in b.columns:
      if c not in ['category','sub_category','date']:
        column_name = str(c)+'_'+str(num)
        b = b.withColumnRenamed(str(c), str(c)+'_'+str(num))
        b = b.drop('sub_category')
    num += 1
    #if no df exists, create one and continually join new columns
    try:
      all_subs = all_subs.drop('sub_category').join(b.drop('sub_category'), on=['cateogry','date'], how='left')
    except:
      all_subs = b

  #Fixes missing columns on union
  try:
    try:
      diff_columns = list(set(all_cats.columns) - set(all_subs.columns))
      for d in diff_columns:
        all_subs = all_subs.withColumn(d, lit(None))
      all_cats = all_cats.union(all_subs)
    except:
      diff_columns = list(set(all_subs.columns) - set(all_cats.columns))
      for d in diff_columns:
        all_cats = all_cats.withColumn(d, lit(None))
      all_cats = all_cats.union(all_subs)

  except Exception as e:
    print e
    all_cats = all_subs

但这很慢 . 任何指导将不胜感激!

1 回答

  • 1

    您的输出不是很合理,但我们可以使用pivot函数实现此结果 . 你需要确定你的规则,否则我可以看到它可能失败的很多情况 .

    from pyspark.sql import functions as F
    from pyspark.sql.window import Window
    
    df.show()
    
    +----------+---------+------------+------------+------------+
    |      date| category|sub_category|metric_sales|metric_trans|
    +----------+---------+------------+------------+------------+
    |2018-01-01|furniture|         bed|         100|          75|
    |2018-01-01|furniture|       chair|         110|          85|
    |2018-01-01|furniture|       shelf|          35|          30|
    |2018-02-01|furniture|         bed|          55|          50|
    |2018-02-01|furniture|       chair|          45|          40|
    |2018-02-01|furniture|       shelf|          10|          15|
    |2018-01-01|      rug|      circle|           2|           5|
    |2018-01-01|      rug|      square|           3|           6|
    |2018-02-01|      rug|      circle|           3|           3|
    |2018-02-01|      rug|      square|           4|           5|
    +----------+---------+------------+------------+------------+
    
    
    
    df.withColumn("fg", F.row_number().over(Window().partitionBy('date', 'category').orderBy("sub_category"))).groupBy('date', 'category', ).pivot('fg').sum('metric_sales', 'metric_trans').show()
    
    +----------+---------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+
    |      date| category|1_sum(CAST(`metric_sales` AS BIGINT))|1_sum(CAST(`metric_trans` AS BIGINT))|2_sum(CAST(`metric_sales` AS BIGINT))|2_sum(CAST(`metric_trans` AS BIGINT))|3_sum(CAST(`metric_sales` AS BIGINT))|3_sum(CAST(`metric_trans` AS BIGINT))|
    +----------+---------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+
    |2018-02-01|      rug|                                    3|                                    3|                                    4|                                    5|                                 null|                                 null|
    |2018-02-01|furniture|                                   55|                                   50|                                   45|                                   40|                                   10|                                   15|
    |2018-01-01|furniture|                                  100|                                   75|                                  110|                                   85|                                   35|                                   30|
    |2018-01-01|      rug|                                    2|                                    5|                                    3|                                    6|                                 null|                                 null|
    +----------+---------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+
    

相关问题