我安装了cloudera manager(CDH 5)并创建了自己的claster。一切都很好但是当我运行任务时它运行缓慢(18分钟)。但ruby的脚本运行大约5秒钟。
我的任务包括:
#mapper.py
import sys
def do_map(doc):
for word in doc.split():
yield word.lower(), 1
for line in sys.stdin:
for key, value in do_map(line):
print(key + "\t" + str(value))
和
#reducer.py
import sys
def do_reduce(word, values):
return word, sum(values)
prev_key = None
values = []
for line in sys.stdin:
key, value = line.split("\t")
if key != prev_key and prev_key is not None:
result_key, result_value = do_reduce(prev_key, values)
print(result_key + "\t" + str(result_value))
values = []
prev_key = key
values.append(int(value))
if prev_key is not None:
result_key, result_value = do_reduce(prev_key, values)
print(result_key + "\t" + str(result_value))
我运行我的任务这是命令:
yarn jar hadoop-streaming.jar -input lenta_articles -output lenta_wordcount -file mapper.py -file reducer.py -mapper "python mapper.py" -reducer "python reducer.py"
运行命令日志:
15/11/17 10:14:27 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [mapper.py, reducer.py] [/opt/cloudera/parcels/CDH-5.4.8-1.cdh5.4.8.p0.4/jars/hadoop-streaming-2.6.0-cdh5.4.8.jar] /tmp/streamjob8334226755199432389.jar tmpDir=null
15/11/17 10:14:29 INFO client.RMProxy: Connecting to ResourceManager at manager/10.128.181.136:8032
15/11/17 10:14:29 INFO client.RMProxy: Connecting to ResourceManager at manager/10.128.181.136:8032
15/11/17 10:14:31 INFO mapred.FileInputFormat: Total input paths to process : 909
15/11/17 10:14:32 INFO mapreduce.JobSubmitter: number of splits:909
15/11/17 10:14:32 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1447762910705_0010
15/11/17 10:14:32 INFO impl.YarnClientImpl: Submitted application application_1447762910705_0010
15/11/17 10:14:32 INFO mapreduce.Job: The url to track the job: http://manager:8088/proxy/application_1447762910705_0010/
15/11/17 10:14:32 INFO mapreduce.Job: Running job: job_1447762910705_0010
15/11/17 10:14:49 INFO mapreduce.Job: Job job_1447762910705_0010 running in uber mode : false
15/11/17 10:14:49 INFO mapreduce.Job: map 0% reduce 0%
15/11/17 10:16:04 INFO mapreduce.Job: map 1% reduce 0%
lenta_wordcount文件夹的大小2.5 MB。它由909个文件组成。 Аverage文件大小3КБ。
如果您需要学习或执行任何命令,请提出问题
我做错了什么?
答案 0 :(得分:0)
Hadoop在处理大量小文件方面效率不高,但在处理少量大文件时效率很高。
由于您已经在使用Cloudera,因此请查看使用Hadoop的大量小文件提高性能的替代方法,如Cloudera article中所述
处理缓慢的主要原因
通过小文件读取通常会导致大量搜索和从datanode到datanode的大量跳转,以检索每个小文件,所有这些都是低效的数据访问模式。
如果您有更多文件,则需要更多数量的Mappers才能阅读&处理数据。成千上万的Mappers处理小文件&通过网络将输出传递给Reducers会降低性能。
使用LZO压缩将输入作为顺序文件传递是处理大量小文件的最佳替代方法之一。请查看SE Question 1和Other Alternative
还有一些其他选择(有些与phtyon无关),但你应该看看article
Change the ingestion process/interval
Batch file consolidation
Sequence files
HBase
S3DistCp (If using Amazon EMR)
Using a CombineFileInputFormat
Hive configuration settings
Using Hadoop’s append capabilities