使用pyspark更改配置单元表后的模式错误-Java 学习之路

我在hive中有一个表，名为 test ，列 id 和 name

现在我在hive中有另一个名为mysql的表，列为 id ， name 和 city .

现在我想比较两个表的模式并将列差异添加到hive表 test .

hive_df= sqlContext.table("testing.test")

mysql_df= sqlContext.table("testing.mysql")

hive_df.dtypes

[('id', 'int'), ('name', 'string')]

mysql_df.dtypes

[('id', 'int'), ('name', 'string'), ('city', 'string')]

hive_dtypes=hive_df.dtypes

hive_dtypes

[('id', 'int'), ('name', 'string')]


mysql_dtypes= mysql_df.dtypes

diff = set(mysql_dtypes) ^ set(hive_dtypes)

diff

set([('city', 'string')])

for col_name, col_type in diff:
...  sqlContext.sql("ALTER TABLE testing.test ADD COLUMNS ({0} {1})".format(col_name, col_type))
...

完成所有这些操作后，hive表 test 将添加新列 city ，并按预期添加空值 .

现在，当我关闭火花 Session 并开启一个新的火花 Session 时，当我这样做

hive_df= sqlContext.table("testing.test")

然后

hive_df

我应该得到

DataFrame[id: int, name: string, city: string]

但我明白了

DataFrame[id: int, name: string]

当我做一个desc hive表 test

hive> desc test;
OK
id                      int
name                    string
city                    string

在我们更改相应的配置单表后，为什么架构更改未反映在Pyspark数据帧中？

仅供参考我使用的是火花1.6

1 回答

1
看起来这个问题https://issues.apache.org/jira/browse/SPARK-9764有一个Jira，已在Spark 2.0中修复 .

对于那些使用spark 1.6的人，尝试使用 sqlContext 创建表 .

喜欢 first register the data frame as temp table 然后再做
```
sqlContext.sql("create table table as select * from temptable")
```
这样，在您更改配置单元表并重新创建火花数据框后， df 也将具有新添加的列 .

在@ zero323的帮助下解决了这个问题
回复于 2024-04-26T19:19:29+08:00

使用pyspark更改配置单元表后的模式错误

1 回答

相关问题