pyspark列不可迭代-Java 学习之路

当我尝试groupBy并获得max时，拥有这个数据帧我得到Column是不可迭代的：

linesWithSparkDF
+---+-----+
| id|cycle|
+---+-----+
| 31|   26|
| 31|   28|
| 31|   29|
| 31|   97|
| 31|   98|
| 31|  100|
| 31|  101|
| 31|  111|
| 31|  112|
| 31|  113|
+---+-----+
only showing top 10 rows


ipython-input-41-373452512490> in runlgmodel2(model, data)
     65     linesWithSparkDF.show(10)
     66 
---> 67     linesWithSparkGDF = linesWithSparkDF.groupBy(col("id")).agg(max(col("cycle")))
     68     print "linesWithSparkGDF"
     69 

/usr/hdp/current/spark-client/python/pyspark/sql/column.py in __iter__(self)
    241 
    242     def __iter__(self):
--> 243         raise TypeError("Column is not iterable")
    244 
    245     # string methods

TypeError: Column is not iterable

1 回答

12
它's because, you'覆盖了 apache-spark 提供的 max 定义，很容易发现，因为 max 期待 iterable .

要解决此问题，您可以使用a different syntax，它应该可以正常工作 .
```
inesWithSparkGDF = linesWithSparkDF.groupBy(col("id")).agg({"cycle": "max"})
```
或者
```
from pyspark.sql.functions import max as sparkMax

linesWithSparkGDF = linesWithSparkDF.groupBy(col("id")).agg(sparkMax(col("cycle")))
```
回复于 2024-04-26T22:40:11+08:00

pyspark列不可迭代

1 回答

相关问题