我有一个3grams的元组,如下所示:
from nltk import ngrams
test_data = ["this is all test data", "this not"]
three_gram_list = []
for data in test_data:
three_grams = ngrams(data.split(" "), 3)
for gram in three_grams:
three_gram_list.append(gram)
我想要做的是创建一个函数,检查每个3-gram是否在同一元组中使用了单词。因此我做了以下事情:
def create_specific_trigram(three_grams, parameters1, parameters2):
condition1 = False
condition2 = False
for three in three_grams:
for num in range(1, 3):
if three[num] in parameters1:
condition1 = True
for num in range(1, 3):
if three[num] in parameters2:
condition2 = True
if condition1 and condition2:
print(three)
但我现在用一些参数运行它:
parameters1 = ("test", "testing")
parameters2 = ("data", "datas")
for sentence in test_data:
create_specific_trigram(three_grams, paramaters1, parameters2)
我得到以下输出。
('all', 'test', 'data')
('all', 'test', 'data')
但是我每个句子只找一个输出。所以在这种情况下:
('all', 'test', 'data')
有关我应该应用哪些更改的想法?
答案 0 :(得分:1)
启动功能three_grams
时,您可以使用sentence
的相同值启动它,与test_data = ["this is all test data", "this not"]
parameters1 = ("test", "testing")
parameters2 = ("data", "datas")
#============================================
#implementation of create_specific_trigram
# ...
#============================================
for sentence in test_data:
three_grams = ngrams(sentence.split(" "), 3)
create_specific_trigram(three_grams, paramaters1, parameters2)
无关。
试试这个:
/<(?!\s*br\s*\/?)[^>]+>/gi