我最近学习了 Mapreduce ,我复制的代码是:
import os
import re
from mrjob.job import MRJob
from mrjob.step import MRStep
word_search_re = re.compile(r"[\w']+")
class ExtractPosts(MRJob):
post_start = False
post = []
def mapper(self, key, line):
filename = os.environ["map_input_file"]
gender = filename.split(".")[1]
try:
docnum = int(filename[0])
except:
docnum = 8
if filename.startswith("51"):
# remove leading and trailing whitespace
line = line.strip()
if line == "<post>":
self.post_start = True
elif line == "</post>":
self.post_start = False
yield gender, repr("\n".join(self.post))
self.post = []
elif self.post_start:
self.post.append(line)
然后以管理员身份在命令行中执行:
python extract_posts.py f:/blogs/51* --output-dir=f:/blogposts
我在网上搜索,答案似乎不适合我的问题。我不知道该怎么做。 我的输出目录中确实有一些文件,例如part-00000。