所以,我正在运行Python 3.3.2,我有一个字符串(句子,段落):
mystring=["walk walked walking talk talking talks talked fly flying"]
我还有另一个列表,其中包含我需要在该字符串中搜索的单词:
list_of_words=["walk","talk","fly"]
我的问题是,有没有办法得到结果:
最重要的是,是否可以计算一个单词的所有可能变体?
答案 0 :(得分:2)
一种方法可能是用空格分割字符串,然后查找包含要为其找到变体的特定单词的所有单词。
例如:
def num_variations(word, sentence):
return sum(1 for snippit in sentence.split(' ') if word in snippit)
for word in ["walk", "talk", "fly"]:
print word, num_variations(word, "walk walked walking talk talking talks talked fly flying")
然而,这种方法有些幼稚,不会理解英语形态。例如,使用此方法,“fly”将不匹配“苍蝇”。
在这种情况下,你可能需要使用某种自然语言库来配备一个像样的字典来捕捉这些边缘情况。
您可能会发现this answer有用。它通过使用NLTK库找到单词的词干(删除复数,不规则拼写等)然后使用类似于上面的方法对它们求和来完成类似的事情。但是,对于你的情况来说可能有点过分,这取决于你想要完成的事情。
答案 1 :(得分:0)
from difflib import get_close_matches
mystring="walk walked walking talk talking talks talked fly flying"
list_of_words=["walk","talk","fly"]
sp = mystring.split()
for x in list_of_words:
li = [y for y in get_close_matches(x,sp,cutoff=0.5) if x in y]
print '%-7s %d in %-10s' % (x,len(li),li)
结果
walk 2 in ['walk', 'walked']
talk 3 in ['talk', 'talks', 'talked']
fly 2 in ['fly', 'flying']
截止值指的是SequenceMatcher
计算的相同比率:
from difflib import SequenceMatcher
sq = SequenceMatcher(None)
for x in list_of_words:
for w in sp:
sq.set_seqs(x,w)
print '%-7s %-10s %f' % (x,w,sq.ratio())
结果
walk walk 1.000000
walk walked 0.800000
walk walking 0.727273
walk talk 0.750000
walk talking 0.545455
walk talks 0.666667
walk talked 0.600000
walk fly 0.285714
walk flying 0.200000
talk walk 0.750000
talk walked 0.600000
talk walking 0.545455
talk talk 1.000000
talk talking 0.727273
talk talks 0.888889
talk talked 0.800000
talk fly 0.285714
talk flying 0.200000
fly walk 0.285714
fly walked 0.222222
fly walking 0.200000
fly talk 0.285714
fly talking 0.200000
fly talks 0.250000
fly talked 0.222222
fly fly 1.000000
fly flying 0.666667
答案 2 :(得分:0)
我知道这是一个老问题,但我觉得如果不提及NLTK库,这个讨论就不会完整,它提供了大量的自然语言处理工具,包括可以很容易地执行此任务的工具。
基本上,您希望将目标列表中未反射的单词与mystring中单词的未反射形式进行比较。有两种常见的消除变形的方法(例如-ing -ed -s):词干或词形变化。在英语中,将词汇缩减为词典形式的词汇化通常会更好,但对于这项任务,我认为词干是正确的。无论如何,词干通常都会更快。
mystring="walk walked walking talk talking talks talked fly flying"
list_of_words=["walk","talk","fly"]
word_counts = {}
from nltk.stem.snowball import EnglishStemmer
stemmer = EnglishStemmer()
for target in list_of_words:
word_counts[target] = 0
for word in mystring.split(' '):
# Stem the word and compare it to the stem of the target
stem = stemmer.stem(word)
if stem == stemmer.stem(target):
word_counts[target] += 1
print word_counts
输出:
{'fly': 2, 'talk': 4, 'walk': 3}