Py4JJavaError:在Jupyter笔记本中使用Pyspark尝试使用“spark”运行示例 .

我正在尝试在Jupyter笔记本中运行PySpark的示例(例如spark / examples / src / main / python / ml / fpgrowth_example.py) . 但是,我在尝试做任何时候都会遇到异常"spark.(some function)"在这个例子的情况下,它是spark.createDataFrame,但我也尝试过spark.read,它导致了同样的异常 . 我也尝试创建自己的sparkSession,并在启动时使用已经在Jupyter笔记本中的那个,并且都没有正常工作 . 我能找到的主要重复是AnalysisException: 'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'

码:

# $example on$
from pyspark.ml.fpm import FPGrowth 
# $example off$
from pyspark.sql import SparkSession

if __name__ == "__main__":
    spark = SparkSession\
    .builder\
    .appName("FPGrowthExample")\
    .getOrCreate()

# $example on$
df = spark.createDataFrame([
    (0, [1, 2, 5]),
    (1, [1, 2, 3, 5]),
    (2, [1, 2])
], ["id", "items"])

例外:

AnalysisException Traceback(最近一次调用最后一次)
<module>()中的<ipython-input-2-512249e97d93>
3(1,[1,2,3,5]),
4(2,[1,2])
----> 5],[“id”,“items”])

createDataFrame中的D:\ spark \ spark-2.3.1-bin-hadoop2.7 \ python \ pyspark \ sql \ session.py(self,data,schema,samplingRatio,verifySchema)
691 rdd,schema = self._createFromLocal(map(prepare,data),schema)
692 jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())

  • 693 jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(),schema.json())
    694 df = DataFrame(jdf,self._wrapped)
    695 df._schema = schema

在__call __中的D:\ spark \ spark-2.3.1-bin-hadoop2.7 \ python \ lib \ py4j-0.10.7-src.zip \ py4j \ java_gateway.py(self,* args)
1255回答= self.gateway_client.send_command(命令)
1256 return_value = get_return_value(

  • 1257回答,self.gateway_client,self.target_id,self.name)
    1258
    1259对于temp_args中的temp_arg:

装饰中的D:\ spark \ spark-2.3.1-bin-hadoop2.7 \ python \ pyspark \ sql \ utils.py(* a,** kw)
67 e.java_exception.getStackTrace()))
68如果s.startswith('org.apache.spark.sql.AnalysisException:'):
---> 69引发AnalysisException(s.split(':',1)[1],stackTrace)
70如果s.startswith('org.apache.spark.sql.catalyst.analysis'):
71引发AnalysisException(s.split(':',1)[1],stackTrace)

AnalysisException:'java.lang.RuntimeException:java.lang.RuntimeException:无法实例化org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'


Py4JJavaError Traceback(最近一次调用最后一次)
装饰中的D:\ spark \ spark-2.3.1-bin-hadoop2.7 \ python \ pyspark \ sql \ utils.py(* a,** kw)
62尝试:
---> 63返回f(* a,** kw)
64除了py4j.protocol.Py4JJavaError为e:

get_return_value中的D:\ spark \ spark-2.3.1-bin-hadoop2.7 \ python \ lib \ py4j-0.10.7-src.zip \ py4j \ protocol.py(answer,gateway_client,target_id,name)
327“调用{0} {1} {2}时发生错误 . \ n” .

  • 328格式(target_id,“ . ”,名称),值)
    329其他:

Py4JJavaError:调用o24.applySchemaToPythonRDD时发生错误 .
:org.apache.spark.sql.AnalysisException:java.lang.RuntimeException:java.lang.RuntimeException:无法实例化org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;
在org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
在org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:194)
at org.apache.spark.sql.internal.SharedState.externalCatalog $ lzycompute(SharedState.scala:114)
at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102)
在org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:39)
在org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog $ lzycompute(HiveSessionStateBuilder.scala:54)
在org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52)
在org.apache.spark.sql.hive.HiveSessionStateBuilder $$ anon $ 1. <init>(HiveSessionStateBuilder.scala:69)
在org.apache.spark.sql.hive.HiveSessionStateBuilder.analyzer(HiveSessionStateBuilder.scala:69)
在org.apache.spark.sql.internal.BaseSessionStateBuilder $$ anonfun $ build $ 2.apply(BaseSessionStateBuilder.scala:293)
在org.apache.spark.sql.internal.BaseSessionStateBuilder $$ anonfun $ build $ 2.apply(BaseSessionStateBuilder.scala:293)
at org.apache.spark.sql.internal.SessionState.analyzer $ lzycompute(SessionState.scala:79)
在org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:79)
在org.apache.spark.sql.execution.QueryExecution.analyzed $ lzycompute(QueryExecution.scala:57)
在org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
在org.apache.spark.sql.Dataset $ .ofRows(Dataset.scala:74)
在org.apache.spark.sql.SparkSession.internalCreateDataFrame(SparkSession.scala:577)
在org.apache.spark.sql.SparkSession.applySchemaToPythonRDD(SparkSession.scala:752)
在org.apache.spark.sql.SparkSession.applySchemaToPythonRDD(SparkSession.scala:737)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
在py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
在py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
在py4j.commands.CallCommand.execute(CallCommand.java:79)
在py4j.GatewayConnection.run(GatewayConnection.java:238)
在java.lang.Thread.run(未知来源)
引起:java.lang.RuntimeException:java.lang.RuntimeException:无法实例化org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
在org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
在org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:180)
在org.apache.spark.sql.hive.client.HiveClientImpl . <init>(HiveClientImpl.scala:114)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
在org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:264)
在org.apache.spark.sql.hive.HiveUtils $ .newClientForMetadata(HiveUtils.scala:385)
在org.apache.spark.sql.hive.HiveUtils $ .newClientForMetadata(HiveUtils.scala:287)
在org.apache.spark.sql.hive.HiveExternalCatalog.client $ lzycompute(HiveExternalCatalog.scala:66)
在org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
在org.apache.spark.sql.hive.HiveExternalCatalog $$ anonfun $ databaseExists $ 1.apply $ mcZ $ sp(HiveExternalCatalog.scala:195)
在org.apache.spark.sql.hive.HiveExternalCatalog $$ anonfun $ databaseExists $ 1.apply(HiveExternalCatalog.scala:195)
在org.apache.spark.sql.hive.HiveExternalCatalog $$ anonfun $ databaseExists $ 1.apply(HiveExternalCatalog.scala:195)
在org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
......还有30多个
引起:java.lang.RuntimeException:无法实例化org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
在org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1523)
在org.apache.hadoop.hive.metastore.RetryingMetaStoreClient . <init>(RetryingMetaStoreClient.java:86)
在org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
在org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
在org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)
在org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)
在org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
......还有45个
引起:java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
在org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
......还有51个

回答(1)

2 years ago

对于spark版本2.3.1,我能够创建数据框架,如:

df = spSession.createDataFrame(someRDD)

通过从文件\ spark \ python \ pyspark \ shell.py中删除45中的此函数

SparkContext._jvm.org.apache.hadoop.hive.conf.HiveConf()
    spark = SparkSession.builder\
        .enableHiveSupport() <--- Delete this line
        .getOrCreate()

我无法进一步解释,但我想我已经没有在我的Windows 10中安装Hive,删除此行使得pysparks不使用Hive并使用任何其他有效创建de DataFrame的东西,我希望这有助于