Question

我试图找到一种方法，使用python从一段文本中获取所有二元组，这些文本不一定是连续的单词，而是由文本中的N个单词分隔。尽管我找到了有关如何查找连续双字母组的许多答案，但我还没有找到有关非连续二元组的明确答案。

我使用sklearn和CountVectorizer作为下面的代码来查找我的双字母组，但是我不知道它是否能够执行此任务。

from sklearn.feature_extraction.text import CountVectorizer

#finding bigrams and their frequency
bigram_vectorizer = CountVectorizer(ngram_range=(2,2))
X = bigram_vectorizer.fit_transform(["i love coding with python"])
print(bigram_vectorizer.get_feature_names())
bigram_frequency = X.toarray()

哪些印刷品：

["i love", "love coding", "coding with", "with python"]

具有一段文本sentence= "i love coding with python"的预期结果应返回以下内容：

[('i', 'love'), ('i', 'coding'), ('i', 'with'), ... ,('coding', 'with'), 
('coding', 'python'), ('with', 'python')]

Answer 1

是否需要使用sklearn解决？要找到二元组，请使用以下功能：

 def nonConsBigram(text):
     x = text.split(' ')
     ret = []
     while len(x) > 1:
         current = x[0]
         for i in x[1:]:
             ret += [(current, i)]
         x = x[1:]
     return ret

结果：

[('i', 'love'),
 ('i', 'coding'),
 ('i', 'with'),
 ('i', 'python'),
 ('love', 'coding'),
 ('love', 'with'),
 ('love', 'python'),
 ('coding', 'with'),
 ('coding', 'python'),
 ('with', 'python')]

如何找到由N个令牌窗口分隔的二元语法？

1 个答案: