我在Python中有一个mapreduce工作,当我在测试中运行它时,它可以工作,但是当我通过hadoop传递它失败时。
我正在使用我从中获得的文件:
wget http://stat-computing.org/dataexpo/2009/2008.csv.bz2
地图工作是:
#!/usr/bin/python
import sys
for line in sys.stdin:
values = line.split(',')
if values[13] != 'AirTime' and values[13] != 'NA':
print '%s%s\t%s\t%s' % (values[8], values[9], 'flights', 1)
print '%s%s\t%s\t%s' % (values[8], values[9], 'airTime', float(values[13]))
减少工作:
#!/usr/bin/python
import sys
(lastFlight, total, time) = (None, 0, 0)
for line in sys.stdin:
(flight, key, value) = line.split('\t')
if lastFlight and flight !=lastFlight:
if total > 0:
print '%s\t%f' % (lastFlight, time/total)
lastFlight = flight
if key == 'flights':
(flight, total, time) = (value, float(value), 0)
elif key == 'airTime':
(flight, total, time) = (value, 0, float(value))
else:
lastFlight = flight
(total, time) = (total + float(value), time + float(value))
if lastFlight:
if total > 0:
print '%s\t%f' % (lastFlight, time/total)
测试说明:
head *.csv ¦ ./map.py ¦ sort ¦ ./reduce.py >out.log 2>&1
我可以看到生成的输出没有错误
hadoop指令:
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-
streaming-2.2.0.2.0.10.0-1.jar –input /user/flight/*.csv
–output /user/flight/result1 –file map.py –file reduce.py
–mapper map.py –mapper map.py –combiner reduce.py –reducer reduce.py
地图有效,但我收到了一个错误。错误不是很具体:
16/11/26 11:17:10 INFO mapreduce.Job: map 100% reduce 28%
16/11/26 11:17:11 INFO mapreduce.Job:
Task Id : attempt_1480024909550_0014_r_000000_0,
Status : FAILED
Error: java.lang.RuntimeException:
PipeMapRed.waitOutputThreads(): subprocess failed with code 1
如果我查看我得到的工作日志:
line 7, in <module>
(flight, key, value) = line.split('\t')
任何有关减少部分失败的想法
由于