Question

我正在尝试从S3加载数据转换它，然后插入带分区的hive表。

首先，我开始使用creation_date（bigint）作为分区键，但它运行良好，但是现在当我尝试使用creation_month分区键插入相同的数据时，它失败了。

这是代码

var hiveCtx = new org.apache.spark.sql.hive.HiveContext(sc)
var df = hiveCtx.read.json("s3n://spark-feedstore/2016/1/*")
import org.apache.spark.storage.StorageLevel
import org.apache.spark.sql.SaveMode
hiveCtx.sql("SET hive.exec.dynamic.partition = true")
hiveCtx.sql("SET hive.exec.dynamic.partition.mode = nonstrict")

df.persist(StorageLevel.MEMORY_AND_DISK)
df.registerTempTable("posts")

第一个表的架构

[external_id,string,]
[tags,array<string>,]
[creation_date,bigint,]
[video_url,string,]
# Partition Information      
creation_date bigint

第二张表的架构

[external_id,string,]
[tags,array<string>,]
[creation_date,bigint,]
[video_url,string,]
[creation_month,date,]
# Partition Information      
creation_month bigint

使用正常插入第一张表。

var udf = hiveCtx .sql("select externalId as external_id, first(sourceMap['tags']) as tags, first(sourceMap['creation_date']) as creation_date, 
first(sourceMap['video_url']) as video_url
from posts group by externalId")

udf.write.mode(SaveMode.Append).partitionBy("creation_date").insertInto("posts_1")

但是插入第二个表会产生错误。

var udf = hiveCtx .sql("select externalId as external_id, first(sourceMap['brand_hashtags']) as brand_hashtags, first(sourceMap['creation_date']) as creation_date,

首先（sourceMap ['video_url']）作为video_url，trunc（from_unixtime（first（sourceMap ['creation_date']）/ 1000），'MONTH'）作为来自postsId的帖子组的creation_month“）

 udf.write.mode(SaveMode.Append).partitionBy("creation_month").insertInto("posts_2")

错误：

org.apache.spark.sql.AnalysisException: cannot resolve 'cast(creation_date as array<string>)' due to data type mismatch: cannot cast LongType to ArrayType(StringType,true);

当我们添加另一个字段creation_month时，我不确定会发生什么变化。两个表的模式的每个方面看起来完全相同。

Answer 1

我遇到了问题。这是在列的排序。

字段顺序是

external_id, tag, video_url, creation_date

但是在选择查询中我有了它

external_id, creation_date, tag, video_url

因此，Hive试图将creation_date强制转换为数组

Spark SQL insertInto（）失败了分区键

1 个答案: