我试图理解更好的大数据编程,但我对python几乎一无所知。所以我使用mapreduce范例,实际上在python中,我将一些存储在某个目录中的文本文件称为mydir
,这样我的数据源就是:
global_file = glob.glob("mydir/*")
def file_contents(file_name):
f = open(file_name)
try:
return f.read()
finally:
f.close()
datasource = dict((file_name, file_contents(file_name)) for file_name in global_file)
然后我的mapreduce函数是
#each line in each text file is structured as follow : paper-id:::author1::author2::…. ::authorN:::title
def mapfn(k, v):
for w in v.splitlines():
separator = w.split('\:\:|\:\:\:')
for x in separator[1:len(separator)-1]:
for y in separator[-1].split():
yield x + y, 1
首先,k
和v
代表一个键值对,其中k
是文件的ID,v
是后一个文件的内容。 (最后,我想获得按作者分组的每个单词的出现次数)
现在的问题是,当我运行算法时,我得到一个空白数组结果。我的python语法是否正确?
答案 0 :(得分:1)
我通过更好的命名和分割的正确regexp部分重写了mapfn函数,并添加了一个简单的测试:
import re
datasource = {
"foo":(
"paper-1:::author1::author2::authorN:::title1\n"
"paper-2:::author21::author22::author23::author2N:::title2\n"
"paper-3:::author31::author32:::title3"
)
}
def mapfn(k, v):
for line in v.splitlines():
data = re.split(r":{2,3}", line)
words = data[-1].split()
for author in data[1:-1]:
for word in words:
yield author + word, 1
def main():
for k, v in datasource.items():
for result in mapfn(k, v):
print result
if __name__ == "__main__":
main()
这会产生以下结果:
bruno@betty ~/Work/playground $ python mapf.py
('author1title1', 1)
('author2title1', 1)
('authorNtitle1', 1)
('author21title2', 1)
('author22title2', 1)
('author23title2', 1)
('author2Ntitle2', 1)
('author31title3', 1)
('author32title3', 1)
不确定这是你的预期,但至少会产生一些输出。到目前为止,我没有任何关于mapReduce的实践经验,所以你要么必须告诉更多有关上下文以及如何运行代码和/或等待本地mapReduce大师的信息。