创建合并两个其他列的Pyspark DataFrame列，为什么我得到'unicode'对象的错误没有属性isNull？-Java 学习之路

我在使用Pyspark Dataframe时遇到了一些麻烦 . 具体来说，我正在尝试为数据帧创建一个列，这是合并数据帧的两列的结果 .

例如 .

this_dataframe = this_dataframe.withColumn('new_max_price', coalesce(this_dataframe['max_price'],this_dataframe['avg(max_price)']).cast(FloatType()))

此代码的问题是它仍然在某些行中返回值“null” . 具体来说，我正在运行此代码：

this_dataset.where(col("new_max_price").isNull()).count()

此代码给出了积极的结果 . 因此，虽然此代码有效，但它不会产生预期的结果 .

我发现其他一些问题（例如Selecting values from non-null columns in a PySpark DataFrame）被认为是相似的，但由于某种原因我无法复制他们的结果 .

这里有一些基于前面提到的链接的代码：

def coalesce_columns(c1, c2):
    if c1 != None and c2 != None:
        return c1
    elif c1 == None:
        return c2
    else:
        return c1

coalesceUDF = udf(coalesce_columns)
max_price_col = [coalesceUDF(col("max_price"), col("avg(max_price)")).alias("competitive_max_price")]
this_dataset.select(max_price_col).show()

当我尝试执行最后一行来测试我的结果是否正确时，我收到一个错误 .

AttributeError：'unicode'对象没有属性'isNull'

所以基本上问题是，如何使用spark sql函数创建一个合并两个pyspark数据帧列的列？如果这是不可能的，我可以使用什么样的UDF来创建一些我可以附加到另一个数据帧的数据帧列？

1 回答

我认为 coalesce 实际上是在做它的工作，问题的根源是你在两列中都有 null 值，在合并后产生 null . 我举一个可以帮助你的例子 .

from pyspark.sql.types import FloatType
from pyspark.sql.functions import *

data = [Row(a="3.07",b="3.05"),
        Row(a="3.06",b="3.06"),
        Row(a="3.09",b=None),
        Row(a=None,b=None),
        Row(a=None,b="3.06"),
        Row(a=None,b=None)
       ]

df = sqlContext.createDataFrame(data)

tmp = df.withColumn('c', coalesce(df['a'],df['b']).cast(FloatType()))

tmp.where(col("c").isNotNull()).show()


+----+----+----+
|   a|   b|   c|
+----+----+----+
|3.07|3.05|3.07|
|3.06|3.06|3.06|
|3.09|null|3.09|
|null|3.06|3.06|
+----+----+----+

回复于 2024-04-27T12:44:51+08:00

创建合并两个其他列的Pyspark DataFrame列，为什么我得到'unicode'对象的错误没有属性isNull？

1 回答

相关问题