从Spark数据框加载的Hive表执行效果不佳

时间:2017-04-21 19:40:45

标签: apache-spark hive

我创建了一个Hive表,其数据存储为Parquet,snappy压缩。这是一张空桌子。我从Spark加载了该表 - 如图所示 -

hiveContext.read.parquet("/user/a_dgt_intel_trans_u/xpo/impression/event_dt=2017-01-30").registerTempTable("temp")
hiveContext.sql("select a.*,year(event_time)as year,month(event_time) as month,day(event_time)as dt from temp a").registerTempTable("temp1")
hiveContext.sql("insert into xpo_imp_ymd_hive partition(year,month,dt) select * from temp1")

当我从表中选择*时,我收到此警告 -

 WARNING: org.apache.parquet.CorruptStatistics: Ignoring statistics because created_by could not be parsed (see PARQUET-251): parquet-mr version 1.6.0
org.apache.parquet.VersionParser$VersionParseException: Could not parse created_by: parquet-mr version 1.6.0 using format: (.+) version ((.*) )?\(build ?(.*)\)

使用此错误msg -

在几秒钟后执行计数(*)失败
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
]], TaskAttempt 3 failed, info=[Container container_e82_1492612930020_0170_01_000054 finished with diagnostics set to [Container failed, exitCode=-104. Container [pid=40303,containerID=container_e82_1492612930020_0170_01_000054] is running beyond physical memory limits. Current usage: 7.9 GB of 6 GB physical memory used; 13.5 GB of 12.6 GB virtual memory used. Killing container.
Dump of the process-tree for container_e82_1492612930020_0170_01_000054 :

请让我知道我缺少的东西

0 个答案:

没有答案