如何通过在Python中使用n-gram重叠来对句子进行聚类?

时间:2019-05-21 14:38:31

标签: python nlp n-gram

我需要根据它们包含的常见n-gram对句子进行聚类。我可以使用nltk轻松提取n-gram,但是我不知道如何基于n-gram重叠执行聚类。这就是为什么我不能编写如此真实的代码的原因,首先我为此感到抱歉。我写了6个简单的句子,并预期输出以说明该问题。

import nltk

Sentences= """I would like to eat pizza with her.
She would like to eat pizza with olive.
There are some sentences must be clustered.
These sentences must be clustered according to common trigrams.
The quick brown fox jumps over the lazy dog.
Apples are red, bananas are yellow."""

sent_detector = nltk.data.load('tokenizers/punkt/'+'English'+'.pickle')
sentence_tokens = sent_detector.tokenize(sentences.strip())

mytrigrams=[]
for sentence in sentence_tokens:
    trigrams=ngrams(sentence.lower().split(), 3)
    mytrigrams.append(list(trigrams))

在此之后,我不知道(我什至不确定这部分是否还可以。)如何根据常见的字母组合将它们聚类。我尝试使用itertools-combinations,但是迷路了,而且我不知道如何生成多个聚类,因为如果不将每个句子彼此比较就无法知道聚类的数量。预期的输出如下,在此先感谢您的帮助。

Cluster1: 'I would like to eat pizza with her.'
          'She would like to eat pizza with olive.'

Cluster2: 'There are some sentences must be clustered.' 
          'These sentences must be clustered according to common trigrams.'

Sentences do not belong to any cluster:                                
          'The quick brown fox jumps over the lazy dog.'
          'Apples are red, bananas are yellow.'

编辑:我再尝试过combinations次,但它根本没有聚类,只是返回了所有句子对。 (显然我做错了)。

from itertools import combinations

new_dict = {k: v for k, v in zip(sentence_tokens, mytrigrams)}

common=[] 
no_cluster=[]   
sentence_pairs=combinations(new_dict.keys(), 2)

for keys, values in new_dict.items():

    for values in sentence_pairs:
        sentence1= values[0]
        sentence2= values[1]
        #print(sentence1, sentence2)
        if len(set(sentence1) & set(sentence2))!=0:
            common.append((sentence1, sentence2))
        else:
            no_cluster.append((sentence1, sentence2))


print(common)  

但是即使这段代码有效,也无法提供我期望的输出,因为我不知道如何基于常见的n元语法生成多个聚类

1 个答案:

答案 0 :(得分:0)

为了更好地理解您的问题,您可以说明目的和预期结果。
使用Ngrams必须非常小心,在使用ngrams时,您会增加数据集的维数。
我建议您首先使用TD-IDF,只有在未达到最低命中率的情况下,才使用n-gram。
如果您能更好地解释您的问题,我可以帮助您。