我基本上有一个包含两列的数据框 . 它目前是Spark数据框,看起来像这样 .
recommender_sdf.show(5,truncate=False)
+------+-------------------------+
|TRANS |ITEM |
+------+-------------------------+
|163589|How to Motivate Employees|
|373053|How to Motivate Employees|
|280169|How to Motivate Employees|
|495281|How to Motivate Employees|
|3498 |How to Motivate Employees|
+------+-------------------------+
基本上两列,其中第一列TRANS代表个人的ID,而ITEM是他观看的项目视频的名称 .
我想交叉选择此数据集,以便观察的每个项目将显示为单独的列,其总计数为该列的值 .
我首先尝试使用Spark数据框交叉标签功能,但它显示以下Java堆错误
Py4JJavaError: An error occurred while calling o72.crosstab.
: java.lang.OutOfMemoryError: Java heap space
at org.apache.spark.sql.catalyst.expressions.GenericMutableRow.<init>(rows.scala:252)
at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$4.apply(StatFunctions.scala:123)
at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$4.apply(StatFunctions.scala:122)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.sql.execution.stat.StatFunctions$.crossTabulate(StatFunctions.scala:122)
at org.apache.spark.sql.DataFrameStatFunctions.crosstab(DataFrameStatFunctions.scala:133)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
我不明白这个错误,所以我想到将Spark数据框转换为Pandas Dataframe并使用Pandas Cross选项卡功能来交叉制表符,然后转换回Spark Dataframe .
虽然Pandas交叉选项卡工作正常,但它会交叉标记数据集 . 但是现在当我尝试使用createDataframe函数将其转换为Spark Dataframe时,它会抛出以下错误:
# Creating a new pandas dataframe for cross-tab
recommender_pct=pd.crosstab(recommender_pdf['TRANS'], recommender_pdf['ITEM'])
Converting the pandas Dataframe back to Spark Dataframe
In [31]:
# Creating a new Spark dataframe for cross-tab from Pandas data frame
recommender_sct=sqlContext.createDataFrame(recommender_pct)
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-31-234fe3fbf3e5> in <module>()
1 # Creating a new dataframe for cross-tab
----> 2 recommender_sct=sqlContext.createDataFrame(recommender_pct)
/Users/i854319/spark/python/pyspark/sql/context.pyc in createDataFrame(self, data, schema, samplingRatio)
423 rdd, schema = self._createFromRDD(data, schema, samplingRatio)
424 else:
--> 425 rdd, schema = self._createFromLocal(data, schema)
426 jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
427 jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
/Users/i854319/spark/python/pyspark/sql/context.pyc in _createFromLocal(self, data, schema)
331 if has_pandas and isinstance(data, pandas.DataFrame):
332 if schema is None:
--> 333 schema = [str(x) for x in data.columns]
334 data = [r.tolist() for r in data.to_records(index=False)]
335
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 1: ordinal not in range(128)
对此有何帮助?