使用“一袋常用短语”查找不寻常的短语

时间:2018-02-22 10:23:34

标签: python python-3.x pandas scikit-learn text-mining

我的目标是输入一个短语数组,如

array = ["Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.","At vero eos et accusam et justo duo dolores et ea rebum.","Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet."]

并向其提供一个新短语,例如

"Felix qui potuit rerum cognoscere causas"

我想告诉我这是否可能是前面提到的array中的小组的一部分。

我找到了如何检测单词的频率,但我如何找到unsimilarity?毕竟,我的目标是找到不寻常的短语,而不是某些单词的频率。

2 个答案:

答案 0 :(得分:2)

您可以为此目的构建一个简单的“语言模型”。它将估计一个短语的概率,并将具有低平均每个单词概率的短语标记为异常。

对于单词概率估计,它可以使用平滑的单词计数。

这就是模型的样子:

import re
import numpy as np
from collections import Counter

class LanguageModel:
    """ A simple model to measure 'unusualness' of sentences. 
    delta is a smoothing parameter. 
    The larger delta is, the higher is the penalty for unseen words.
    """
    def __init__(self, delta=0.01):
        self.delta = delta
    def preprocess(self, sentence):
        words = sentence.lower().split()
        return [re.sub(r"[^A-Za-z]+", '', word) for word in words]
    def fit(self, corpus):
        """ Estimate counts from an array of texts """
        self.counter_ = Counter(word 
                                 for sentence in corpus 
                                 for word in self.preprocess(sentence))
        self.total_count_ = sum(self.counter_.values())
        self.vocabulary_size_ = len(self.counter_.values())
    def perplexity(self, sentence):
        """ Calculate negative mean log probability of a word in a sentence 
        The higher this number, the more unusual the sentence is.
        """
        words = self.preprocess(sentence)
        mean_log_proba = 0.0
        for word in words:
            # use a smoothed version of "probability" to work with unseen words
            word_count = self.counter_.get(word, 0) + self.delta
            total_count = self.total_count_ + self.vocabulary_size_ * self.delta
            word_probability = word_count / total_count
            mean_log_proba += np.log(word_probability) / len(words)
        return -mean_log_proba

    def relative_perplexity(self, sentence):
        """ Perplexity, normalized between 0 (the most usual sentence) and 1 (the most unusual)"""
        return (self.perplexity(sentence) - self.min_perplexity) / (self.max_perplexity - self.min_perplexity)

    @property
    def max_perplexity(self):
        """ Perplexity of an unseen word """
        return -np.log(self.delta / (self.total_count_ + self.vocabulary_size_ * self.delta))

    @property
    def min_perplexity(self):
        """ Perplexity of the most likely word """
        return self.perplexity(self.counter_.most_common(1)[0][0])

您可以训练此模型并将其应用于不同的句子。

train = ["Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.",
                 "At vero eos et accusam et justo duo dolores et ea rebum.",
                 "Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet."]
test = ["Felix qui potuit rerum cognoscere causas", # an "unlikely" phrase
        'sed diam nonumy eirmod sanctus sit amet', # a "likely" phrase
       ]

lm = LanguageModel()
lm.fit(train)

for sent in test:
    print(lm.perplexity(sent).round(3), sent)

打印给你

8.525 Felix qui potuit rerum cognoscere causas
3.517 sed diam nonumy eirmod sanctus sit amet

你可以看到第一个短语的“异常”高于第二个短语,因为第二个短语是由训练单词构成的。

如果您的“常用”短语语料库足够大,您可以从我使用的1克模型切换到N-gram(对于英语,敏感的N是2或3)。或者,您可以使用递归神经网络来预测每个单词的概率,条件是所有先前的单词。但这需要一个非常庞大的训练语料库。

如果您使用土耳其语等高度语言的语言,则可以使用字符级N-gram而不是单词级模型,或者使用NLTK的词形还原算法预处理文本。

答案 1 :(得分:0)

要查找句子中的常用短语,您可以使用Gensim Phrase (collocation) detection

但是如果你想检测不寻常的短语,你可能 用RegEx 描述某些词性组合模式, 在输入句子上做POS标记 您将能够提取看不见的与您的模式匹配的单词(词组)。