我试图在python中尝试mapreduce对模式。需要检查一个单词是否在文本文件中,然后找到它旁边的单词并产生一对两个单词。继续遇到:
neighbors = words[words.index(w) + 1]
ValueError: substring not found
或
ValueError: ("the") is not in list
file cwork_trials.py
from mrjob.job import MRJob
class MRCountest(MRJob):
# Word count
def mapper(self, _, document):
# Assume document is a list of words.
#words = []
words = document.strip()
w = "the"
neighbors = words.index(w)
for word in words:
#searchword = "the"
#wor.append(str(word))
#neighbors = words[words.index(w) + 1]
yield(w,1)
def reducer(self, w, values):
yield(w,sum(values))
if __name__ == '__main__':
MRCountest.run()
编辑: 尝试使用配对模式在文档中搜索特定单词的每个实例,然后每次都找到它旁边的单词。然后为每个实例产生一对结果,即找到“the”的实例和它旁边的单词,即[the],[book],[the],[cat]等。
from mrjob.job import MRJob
class MRCountest(MRJob):
# Word count
def mapper(self, _, document):
# Assume document is a list of words.
#words = []
words = document.split(" ")
want = "the"
for w, want in enumerate(words, 1):
if (w+1) < len(words):
neighbors = words[w + 1]
pair = (want, neighbors)
for u in neighbors:
if want is "the":
#pair = (want, neighbors)
yield(pair),1
#neighbors = words.index(w)
#for word in words:
#searchword = "the"
#wor.append(str(word))
#neighbors = words[words.index(w) + 1]
#yield(w,1)
#def reducer(self, w, values):
#yield(w,sum(values))
if __name__ == '__main__':
MRCountest.run()
按照目前的情况,我得到每个单词对的产量与相同配对的倍数。
答案 0 :(得分:1)
当您使用words.index("the")
时,您只会在列表或字符串中获得“the”的第一个实例,并且如您找到的那样,如果“the”不存在,您将收到错误。< / p>
另外,你提到你正在尝试产生对,但只产生一个单词。
我认为你要做的事情更像是这样:
def get_word_pairs(words):
for i, word in enumerate(words):
if (i+1) < len(words):
yield (word, words[i + 1]), 1
if (i-1) > 0:
yield (word, words[i - 1]), 1
假设您对两个方向的邻居感兴趣。 (如果没有,你只需要第一次收益。)
最后,由于您使用document.strip()
,我怀疑该文档实际上是一个字符串而不是列表。如果是这种情况,您可以使用words = document.split(" ")
获取单词列表,假设您没有任何标点符号。