Python:计算单词列表,除非某些单词在前面

时间:2015-03-08 04:54:31

标签: python nltk

我不确定是否还有其他相关问题。如果是的话,请让我知道......我已经搜索过了,但我找不到任何......

我想计算一下单词列表,如果某些单词不在单词前面三个或更少的单词。 以下是Count occurrences of a couple of specific words

中的示例

我想数字,“foo”,“bar”,“baz”除了“no”之前的单词前面有三个或更少的单词。在这种情况下,一个酒吧和foo无法计算..

vocab = ["foo", "bar", "baz"]
exception= ["no"]
s = "foo bar baz no bar quux foo bla bla"

wordcount = dict((x,0) for x in vocab)
for w in re.findall(r"\w+", s):
    if w in wordcount:
       wordcount[w] += 1

请帮助我..提前非常感谢你..

3 个答案:

答案 0 :(得分:2)

怎么样:

vocab = ["foo", "bar", "baz"]
exception= ["no"]
s = "foo bar baz no bar quux foo bla bla"

wordcount = dict((x,0) for x in vocab)

words = s.split()

i = 0
while i < len(words):
    cur_word = words[i]
    if cur_word in exception:
        i += 4
    else:
        if cur_word in vocab: wordcount[cur_word] += 1
        i += 1

print wordcount  # {'baz': 1, 'foo': 1, 'bar': 1}

它只是利用了这样一个事实:如果我们遇到&#34; no&#34;,我们可以跳过以下3个元素。

答案 1 :(得分:1)

只需用空字符串替换no以及以下三个单词,然后计算结果字符串中的单词。

>>> s = 'foo bar baz no bar quux foo bla bla'
>>> vocab = ["foo", "bar", "baz"]
>>> exception= ["no"]
>>> wordcount = dict((x,0) for x in vocab)
>>> m = re.sub(r'(?:^|\s)no(\s+\S+){0,3}', '', s)
>>> for w in re.findall(r"\w+", m):
        if w in wordcount:
            wordcount[w] += 1


>>> wordcount
{'foo': 1, 'bar': 1, 'baz': 1}

答案 2 :(得分:1)

你实际上可以使用Python的字符串执行此操作 - 无需正则表达式:

vocab = ["foo", "bar", "baz"]
ex_list= ["no"]
s = "foo bar baz no bar quux foo bla bla"

words=s.split()
wordcount = dict((x,0) for x in vocab)
for i, word in enumerate(words):
    if i>=3 and any(w in ex_list for w in words[i-3:i]):
        continue
    elif word in vocab:    
        wordcount[word]+=1

由于切片不会生成索引错误,因此可以将循环简化为:

for i, word in enumerate(words):
    if word in vocab and not any(w in ex_list for w in words[i-3:i]):
        wordcount[word]+=1