我想运行一个简单的spark脚本,它有一些sparksql查询basicaly Hiveql . 相应的表保存在spark-warehouse文件夹中 .
从pyspark.sql导入SparkSession
from pyspark.sql import Row
spark=SparkSession.builder.config("spark.sql.warehouse.dir", "file:///C:/tmp").appName("TestApp").enableHiveSupport().getOrCreate()
sqlstring="SELECT lflow1.LeaseType as LeaseType, lflow1.Status as Status, lflow1.Property as property, lflow1.City as City, lesflow2.DealType as DealType, lesflow2.Area as Area, lflow1.Did as DID, lesflow2.MID as MID from lflow1, lesflow2 WHERE lflow1.Did = lesflow2.MID"
def queryBuilder(sqlval):
df=spark.sql(sqlval)
df.show()
result=queryBuilder(sqlstring)
print(result.collect())
print("Type of",type(result))
执行火花提交操作后,我正面临波纹错误
py4j.protocol.Py4JJavaError: An error occurred while calling o27.sql. : org.apache.spark.sql.AnalysisException: Table or view not found: lflow1; line 1 pos 211
我无法弄清楚它为什么会发生 . 我看过stackoverflow的一些帖子,他们建议我必须通过编写enablehivesupport()来启用我已经在我的脚本中完成的hive支持 . 但我仍然得到这个错误 . 我正在Windows 10中运行pyspark 2.2 . 请帮助我搞清楚
我已经从dataframe创建并保存了lflow1和lesflow2作为pyspark shell中的永久表 . 这是mycode
df = spark.read.json("C:/Users/codemen/Desktop/test for sparkreport engine/LeaseFlow1.json")
df2 = spark.read.json("C:/Users/codemen/Desktop/test for sparkreport engine/LeaseFlow2.json")
df.write.saveAsTable("lflow1")
df2.write.saveAsTable("lesflow2")
在Pyspark shell我已经执行了查询
spark.sql("SELECT lflow1.LeaseType as LeaseType, lflow1.Status as Status, lflow1.Property as property, lflow1.City as City, lesflow2.DealType as DealType, lesflow2.Area as Area, lflow1.Did as DID, lesflow2.MID as MID from lflow1, lesflow2 WHERE lflow1.Did = lesflow2.MID").show()
和pyspark控制台显示这一点