Question

我有两个文档，我需要计算两个文档中的单词数，以及每个单词的文档名称。 doc1.txt = "我有一个苹果", doc2.txt = "我住在公寓里"。现在我想做 MapReduce，输出将如下所示：（（字，文档名称），计数）。示例：((apple, doc1.txt),1)

#!/usr/bin/env python

导入系统导入全局 #from 字符串导入标点符号 #--- 从标准输入中获取所有行 --- 对于 sys.line 中的行标准输入： #--- 删除前导和尾随空格--- #line=line.translate(None, punctuation).strip('\t') line = line.strip()

#--- split the line into words ---
words = line.split()
doc_name = glob.glob("*.txt")

for doc in doc_name:
    print(doc)
    if doc[] == '':
        
    for word in words:
        
        #word = word.rstrip()
        key = word+ ',' +doc 
        #print '%s\t%s' % (key, "1")

这段代码每次都会打印每个文档中的所有单词，但是对于这两个文档，它们在每个单词中分配每个文档名称，如下所示： (苹果, doc1.txt),1 (苹果, doc2.txt),1

从一组文档中如何为Mapreduce计算python中的单词、文档数量和计数

0 个答案: