Question

我面临着一个我无法理解的怪异问题。

我的源数据的“印象”列有时是bigint /有时是字符串（当我手动浏览数据时）。

为此列注册的HIVE模式为Long。

因此，在加载数据时：

spark.sql("""
CREATE OR REPLACE TEMPORARY VIEW adwords_ads_agg_Yxz AS

SELECT
    a.customer_id
    , a.Campaign
    , ...
    , SUM(BIGINT(a.Impressions)) as Impressions
    , SUM(BIGINT(a.Cost))/1000000 as Cost
FROM adwords_ad a
LEFT JOIN ds_ad_mapping m ON BIGINT(a.Ad_ID) = BIGINT(m.adEngineId) AND a.customer_id = m.reportAccountId
WHERE a.customer_id in (...)
AND a.day >= DATE('2019-02-01')
GROUP BY
    a.customer_id
    , ...
""")

我确保将所有内容都转换为BIGINT。该错误稍后在步骤上发生：

spark.sql("CACHE TABLE adwords_ads_agg_Yxz")

看到此错误后，我在笔记本中运行了相同的代码并尝试进行更多调试，首先，确保转换发生在BIGINT / long的列上：

from pyspark.sql.types import LongType

df = df.withColumn("Impressions", f.col("Impressions").cast(LongType()))
df.createOrReplaceTempView('adwords_ads_agg_Yxz')

，然后从此新转换的df中打印模式：

root
 |-- customer_id: long (nullable = true)
 |-- Campaign: string (nullable = true)
 |-- MatchType: string (nullable = true)
 |-- League: string (nullable = false)
 |-- Ad_Group: string (nullable = true)
 |-- Impressions: long (nullable = true) <- Here!
 |-- Cost: double (nullable = true)

然后进行缓存，但错误仍然存在：

火花作业进度调用o84.sql时发生错误。：org.apache.spark.SparkException：由于阶段失败而导致作业中止：阶段47.0中的任务9失败了4次，最近一次失败：阶段47.0中的任务9.3丢失（TID 2256，ip-172-31-00-00.eu -west-1.compute.internal，执行程序10）：org.apache.spark.sql.execution.QueryExecutionException：无法在文件s3a：//bucket/prod/reports/adwords_ad/customer_id=1111111/date=2019-11-21/theparquetfile.snappy.parquet中转换Parquet列。栏：[展示次数]，期望值：bigint，发现的：BINARY

有人遇到过这个问题，并且/或者知道是什么原因造成的吗？

如果删除缓存，则在尝试将数据写入拼花地板时将发生错误。我也不知道为什么在尝试刷新/编写临时表时会提到adwords_ad表

Answer 1

在镶木地板上使用蜂巢桌时，然后使用SPARK阅读它， SPARK采用实木复合地板的架构，而不采用蜂巢表的外观。

有意义的是，在您的Parquet文件中，架构印象是BINARY，在蜂巢表中，它的Long无关紧要，因为spark从Parquet文件中选取了架构。

spark 2.4 Parquet列无法在文件中转换，列：[印象数]，预期：bigint，找到：BINARY

1 个答案: