我有以下代码从字符串中计算所需的短语:
from nltk.util import ngrams
from nltk import word_tokenize
import pandas as pd
def count_words(convo,df_search):
for i in range(0,len(df_search)):
word = df_search['word'][i] #set the word
a=tuple(word.split(' '))
print word, len([i for i in ngrams(word_tokenize(convo),n=len(a)) if i==a])
convo="I see a tall tree outside. A man is under the tall tree. Actually, there are more than one man under the tall tree"
df_search=pd.DataFrame({'word':['man','tall tree','is under the']})
count_words(convo,df_search)
代码的问题在于它真的很慢,而且#34;" ngrams
每次都要查找新词组。这部分短语是动态的,所以我不知道长度有多长。需要帮助更改代码以加快速度。
答案 0 :(得分:3)
如果您不介意使用re
:
import re
input_string = "I see a tall tree outside. A man is under the tall tree. Actually, there are more than one man under the tall tree"
word = ['man','tall tree','is under the']
for i in word:
print i + ': ' + str(sum(1 for _ in re.finditer(r'\b%s\b' % re.escape(i), input_string)))
答案 1 :(得分:1)
鉴于NLTK的最新版本,有一个everygrams
实现,https://github.com/nltk/nltk/blob/develop/nltk/util.py#L464
之后您可以简单地进行计数:
>>> from nltk import word_tokenize
>>> from nltk.util import everygrams
>>> sent = word_tokenize("I see a tall tree outside. A man is under the tall tree. Actually, there are more than one man under the tall tree")
>>> ng = tuple(['tall', 'tree'])
>>> list(everygrams(sent)).count(ng)
3
如果没有,您可以随时创建自己的Everygrams功能(只需从https://github.com/nltk/nltk/blob/develop/nltk/util.py#L464剪切并粘贴),然后进行计数=)
答案 2 :(得分:1)
您可以用
替换print
语句吗?
print word, convo.count(word)