首页 文章

如何在pyspark中从单行添加多行和多列?

提问于
浏览
1

我是Spark的新手,我有一个需要从单行生成多行和多列的要求 .

Input:

col1   col2  col3  col4

Output

col1 col2   col3  col4 col5 col6 col7 

col1 col2   col3  col4 col8 col9 col10

Logics for new columns:

**col5 :**

if col1==0 and col3!=0:
   col5 = col4/col3

else: 
   col5 = 0


**col6 :**

if col1==0 and col4!=0:
   col6 = (col3*col4)/col1

else: 
   col6 = 0

For first row col7 holds same value as col2

**col8 :**

if col1!=0 and col3!=0:
   col8 = col4/col3

else: 
   col8 = 0
**col9 :**

if col1!=0 and col4!=0:
   col9 = (col3*col4)/col1

else: 
   col9 = 0

For second row col10 = col2+ "_NEW"

最后,“sum”函数需要应用group by . 希望一旦我们转换上述结构就会很容易 .

谷歌的大部分文章都解释了如何使用“withcolumn”选项而不是多列将单列添加到现有数据框 . 这篇文章都没有解释过这个场景 . 所以我想请求您的帮助 .

2 回答

  • 0

    希望这可以帮助!

    from pyspark.sql.functions import col, when, lit, concat, round, sum
    
    #sample data
    df = sc.parallelize([(1, 2, 3, 4), (5, 6, 7, 8)]).toDF(["col1", "col2", "col3", "col4"])
    
    #populate col5, col6, col7
    col5 = when((col('col1') == 0) & (col('col3') != 0), round(col('col4')/ col('col3'), 2)).otherwise(0)
    col6 = when((col('col1') == 0) & (col('col4') != 0), round((col('col3') * col('col4'))/ col('col1'), 2)).otherwise(0)
    col7 = col('col2')
    df1 = df.withColumn("col5", col5).\
        withColumn("col6", col6).\
        withColumn("col7", col7)
    
    #populate col8, col9, col10
    col8 = when((col('col1') != 0) & (col('col3') != 0), round(col('col4')/ col('col3'), 2)).otherwise(0)
    col9 = when((col('col1') != 0) & (col('col4') != 0), round((col('col3') * col('col4'))/ col('col1'), 2)).otherwise(0)
    col10= concat(col('col2'), lit("_NEW"))
    df2 = df.withColumn("col5", col8).\
        withColumn("col6", col9).\
        withColumn("col7", col10)
    
    #final dataframe
    final_df = df1.union(df2)
    final_df.show()
    
    #groupBy calculation
    #final_df.groupBy("col1", "col2", "col3", "col4").agg(sum("col5")).show()
    

    输出是:

    +----+----+----+----+----+----+-----+
    |col1|col2|col3|col4|col5|col6| col7|
    +----+----+----+----+----+----+-----+
    |   1|   2|   3|   4| 0.0| 0.0|    2|
    |   5|   6|   7|   8| 0.0| 0.0|    6|
    |   1|   2|   3|   4|1.33|12.0|2_NEW|
    |   5|   6|   7|   8|1.14|11.2|6_NEW|
    +----+----+----+----+----+----+-----+
    

    如果它解决了你的问题,别忘了告诉我们:)

  • 2

    选择很少:

    • 根据需要多次使用 withColumn (即需要添加多少列)

    • 在数据框上使用 map 来解析列并使用适当的列返回 Row 并在之后创建DataFrame .

相关问题