我有2个脚本,一个映射器和一个reducer。两者都从csv阅读器获取输入。映射器脚本应从制表符分隔的文本文件dataset.csv中获取其输入,reducer的输入应该是映射器的输出。我想将reducer的输出保存到文本文件output.txt。执行此操作的正确命令链是什么?
映射器:
#/usr/bin/python
import sys, csv
reader = csv.reader(sys.stdin, delimiter='\t')
writer = csv.writer(sys.stdout, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL)
for line in reader:
if len(line) > 5: # parse only lines in the forum_node.tsv file
if line[5] == 'question':
_id = line[0]
student = line[3] # author_id
elif line[5] != 'node_type':
_id = line[7]
student = line[3] # author_id
else:
continue # ignore header
print '{0}\t{1}'.format(_id, student)
减速器:
#/usr/bin/python
import sys, csv
reader = csv.reader(sys.stdin, delimiter='\t')
writer = csv.writer(sys.stdout, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL)
oldID = None
students = []
for line in reader:
if len(line) != 2:
continue
thisID, thisStudent = data
if oldID and oldID != thisID:
print 'Thread: {0}, students: {1}'.format(oldID, ', '.join(students))
students = []
thisID = oldID
students.append(thisStudent)
if oldID != None:
print 'Thread: {0}, students: {1}'.format(oldID, ', '.join(students))
答案 0 :(得分:3)
将文件组合在一起:
python mapper.py < dataset.csv | python reducer.py > output.txt
< dataset.csv
在mapper.py
上提供stdin
CSV文件,|
将stdout重定向到另一个推荐。另一个命令是python reducer.py
,> output.txt
将该脚本中的stdout
连接到`output.txt。