Hive Testbench数据生成失败

时间:2018-01-02 14:56:42

标签: hadoop hive yarn benchmarking tez

我克隆了Hive Testbench以尝试在使用Hadoop v2.9.0,Hive 2.3.0和Tez 0.9.0的Apache二进制发行版构建的hadoop集群上运行Hive基准测试。

我设法完成了两个数据生成器的构建:TPC-H和TPC-DS。然后,TPC-H和TPC-DS上的下一步数据生成都失败了。失败是非常一致的,每次它在完全相同的步骤失败并产生相同的错误消息。

对于TPC-H,数据生成屏幕输出在此处:

$ ./tpch-setup.sh 10
ls: `/tmp/tpch-generate/10/lineitem': No such file or directory
Generating data at scale factor 10.
...
18/01/02 14:43:00 INFO mapreduce.Job: Running job: job_1514226810133_0050
18/01/02 14:43:01 INFO mapreduce.Job: Job job_1514226810133_0050 running in uber mode : false
18/01/02 14:43:01 INFO mapreduce.Job:  map 0% reduce 0%
18/01/02 14:44:38 INFO mapreduce.Job:  map 10% reduce 0%
18/01/02 14:44:39 INFO mapreduce.Job:  map 20% reduce 0%
18/01/02 14:44:46 INFO mapreduce.Job:  map 30% reduce 0%
18/01/02 14:44:48 INFO mapreduce.Job:  map 40% reduce 0%
18/01/02 14:44:58 INFO mapreduce.Job:  map 70% reduce 0%
18/01/02 14:45:14 INFO mapreduce.Job:  map 80% reduce 0%
18/01/02 14:45:15 INFO mapreduce.Job:  map 90% reduce 0%
18/01/02 14:45:23 INFO mapreduce.Job:  map 100% reduce 0%
18/01/02 14:45:23 INFO mapreduce.Job: Job job_1514226810133_0050 completed successfully
18/01/02 14:45:23 INFO mapreduce.Job: Counters: 0
SLF4J: Class path contains multiple SLF4J bindings.
...
ls: `/tmp/tpch-generate/10/lineitem': No such file or directory
Data generation failed, exiting.

对于TPC-DS,错误消息在此处:

$ ./tpcds-setup.sh 10
...
18/01/02 22:13:58 INFO Configuration.deprecation: mapred.task.timeout is deprecated. Instead, use mapreduce.task.timeout
18/01/02 22:13:58 INFO client.RMProxy: Connecting to ResourceManager at /192.168.10.15:8032
18/01/02 22:13:59 INFO input.FileInputFormat: Total input files to process : 1
18/01/02 22:13:59 INFO mapreduce.JobSubmitter: number of splits:10
18/01/02 22:13:59 INFO Configuration.deprecation: io.sort.mb is deprecated. Instead, use mapreduce.task.io.sort.mb
18/01/02 22:13:59 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
18/01/02 22:13:59 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1514226810133_0082
18/01/02 22:14:00 INFO client.YARNRunner: Number of stages: 1
18/01/02 22:14:00 INFO Configuration.deprecation: mapred.job.map.memory.mb is deprecated. Instead, use mapreduce.map.memory.mb
18/01/02 22:14:00 INFO client.TezClient: Tez Client Version: [ component=tez-api, version=0.9.0, revision=0873a0118a895ca84cbdd221d8ef56fedc4b43d0, SCM-URL=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git, buildTime=2017-07-18T05:41:23Z ]
18/01/02 22:14:00 INFO client.RMProxy: Connecting to ResourceManager at /192.168.10.15:8032
18/01/02 22:14:00 INFO client.TezClient: Submitting DAG application with id: application_1514226810133_0082
18/01/02 22:14:00 INFO client.TezClientUtils: Using tez.lib.uris value from configuration: hdfs://192.168.10.15:8020/apps/tez,hdfs://192.168.10.15:8020/apps/tez/lib/
18/01/02 22:14:00 INFO client.TezClientUtils: Using tez.lib.uris.classpath value from configuration: null
18/01/02 22:14:00 INFO client.TezClient: Tez system stage directory hdfs://192.168.10.15:8020/tmp/hadoop-yarn/staging/rapids/.staging/job_1514226810133_0082/.tez/application_1514226810133_0082 doesn't exist and is created
18/01/02 22:14:01 INFO client.TezClient: Submitting DAG to YARN, applicationId=application_1514226810133_0082, dagName=GenTable+all_10
18/01/02 22:14:01 INFO impl.YarnClientImpl: Submitted application application_1514226810133_0082
18/01/02 22:14:01 INFO client.TezClient: The url to track the Tez AM: http://boray05:8088/proxy/application_1514226810133_0082/
18/01/02 22:14:05 INFO client.RMProxy: Connecting to ResourceManager at /192.168.10.15:8032
18/01/02 22:14:05 INFO mapreduce.Job: The url to track the job: http://boray05:8088/proxy/application_1514226810133_0082/
18/01/02 22:14:05 INFO mapreduce.Job: Running job: job_1514226810133_0082
18/01/02 22:14:06 INFO mapreduce.Job: Job job_1514226810133_0082 running in uber mode : false
18/01/02 22:14:06 INFO mapreduce.Job:  map 0% reduce 0%
18/01/02 22:15:51 INFO mapreduce.Job:  map 10% reduce 0%
18/01/02 22:15:54 INFO mapreduce.Job:  map 20% reduce 0%
18/01/02 22:15:55 INFO mapreduce.Job:  map 40% reduce 0%
18/01/02 22:15:56 INFO mapreduce.Job:  map 50% reduce 0%
18/01/02 22:16:07 INFO mapreduce.Job:  map 60% reduce 0%
18/01/02 22:16:09 INFO mapreduce.Job:  map 70% reduce 0%
18/01/02 22:16:11 INFO mapreduce.Job:  map 80% reduce 0%
18/01/02 22:16:19 INFO mapreduce.Job:  map 90% reduce 0%
18/01/02 22:19:54 INFO mapreduce.Job:  map 100% reduce 0%
18/01/02 22:19:54 INFO mapreduce.Job: Job job_1514226810133_0082 completed successfully
18/01/02 22:19:54 INFO mapreduce.Job: Counters: 0
...
TPC-DS text data generation complete.
Loading text data into external tables.
Optimizing table time_dim (2/24).
Optimizing table date_dim (1/24).
Optimizing table item (3/24).
Optimizing table customer (4/24).
Optimizing table household_demographics (6/24).
Optimizing table customer_demographics (5/24).
Optimizing table customer_address (7/24).
Optimizing table store (8/24).
Optimizing table promotion (9/24).
Optimizing table warehouse (10/24).
Optimizing table ship_mode (11/24).
Optimizing table reason (12/24).
Optimizing table income_band (13/24).
Optimizing table call_center (14/24).
Optimizing table web_page (15/24).
Optimizing table catalog_page (16/24).
Optimizing table web_site (17/24).
make: *** [store_sales] Error 2
make: *** Waiting for unfinished jobs....
make: *** [store_returns] Error 2
Data loaded into database tpcds_bin_partitioned_orc_10.

我注意到作业运行期间和失败后目标临时HDFS目录始终为空,但生成的子目录除外。

现在我甚至不知道失败是由于Hadoop配置问题,软件版本不匹配还是其他原因造成的。有什么帮助吗?

1 个答案:

答案 0 :(得分:0)

我在运行这份工作时有类似的问题。当我将hdfs位置指定给我有权写入的脚本时,脚本就成功了。

./tpcds-setup.sh 10 <hdfs_directory_path>

脚本开始时我仍然会收到此错误:

Data loaded into database tpcds_bin_partitioned_orc_10.
ls: `<hdfs_directory_path>/10': No such file or directory

然而,脚本成功运行并生成数据并将其加载到最后的hive表中。

希望有所帮助。