Question

我在hive中有一个名为test的表格，列id和name

现在我在hive中有另一个名为mysql的表，其中列为id，name和city。

现在我想比较两个表的模式，并将列差异添加到配置单test。

hive_df= sqlContext.table("testing.test")

mysql_df= sqlContext.table("testing.mysql")

hive_df.dtypes

[('id', 'int'), ('name', 'string')]

mysql_df.dtypes

[('id', 'int'), ('name', 'string'), ('city', 'string')]

hive_dtypes=hive_df.dtypes

hive_dtypes

[('id', 'int'), ('name', 'string')]


mysql_dtypes= mysql_df.dtypes

diff = set(mysql_dtypes) ^ set(hive_dtypes)

diff

set([('city', 'string')])

for col_name, col_type in diff:
...  sqlContext.sql("ALTER TABLE testing.test ADD COLUMNS ({0} {1})".format(col_name, col_type))
...

完成所有这些后，配置表test将添加新列city，并按预期添加空值。

现在，当我关闭火花会话并打开一个新的火花会话时，我会

hive_df= sqlContext.table("testing.test")

然后

hive_df

我应该

DataFrame[id: int, name: string, city: string]

但我得到了这个

DataFrame[id: int, name: string]

当我做一个描述表test

时

hive> desc test;
OK
id                      int
name                    string
city                    string

为什么在更改相应的配置单表后，架构更改未反映在Pyspark数据框中？

仅供参考我使用spark 1.6

Answer 1

看起来这个问题的Jira https://issues.apache.org/jira/browse/SPARK-9764已在Spark 2.0中得到修复。

对于使用spark 1.6的用户，请尝试使用sqlContext创建表格。

与first register the data frame as temp table相同，然后执行

sqlContext.sql("create table table as select * from temptable")

这样，在您更改配置单元表并重新创建spark数据框后，df也会添加新添加的列。

在@ zero323

的帮助下解决了这个问题

使用pyspark

1 个答案: