如何使用pandas / sklearn删除停止短语/停止ngram(多字符串)?

时间:2017-07-31 22:25:30

标签: python pandas scikit-learn nlp

我想阻止某些短语进入我的模型。例如,我想防止红玫瑰'从进入我的分析。我了解如何添加Adding words to scikit-learn's CountVectorizer's stop list中给出的单个停用词:

from sklearn.feature_extraction import text
additional_stop_words=['red','roses']

然而,这也导致其他ngrams,如红色郁金香'或者是蓝玫瑰'没被发现。

我正在构建一个TfidfVectorizer作为我模型的一部分,我意识到在这个阶段之后可能需要输入我需要的处理但我不知道该怎么做。

我最终的目标是在一段文字上进行主题建模。以下是我正在处理的代码片段(几乎直接来自https://de.dariah.eu/tatom/topic_model_python.html#index-0):

from sklearn import decomposition

from sklearn.feature_extraction import text
additional_stop_words = ['red', 'roses']

sw = text.ENGLISH_STOP_WORDS.union(additional_stop_words)
mod_vectorizer = text.TfidfVectorizer(
    ngram_range=(2,3),
    stop_words=sw,
    norm='l2',
    min_df=5
)

dtm = mod_vectorizer.fit_transform(df[col]).toarray()
vocab = np.array(mod_vectorizer.get_feature_names())
num_topics = 5
num_top_words = 5
m_clf = decomposition.LatentDirichletAllocation(
    n_topics=num_topics,
    random_state=1
)

doctopic = m_clf.fit_transform(dtm)
topic_words = []

for topic in m_clf.components_:
    word_idx = np.argsort(topic)[::-1][0:num_top_words]
    topic_words.append([vocab[i] for i in word_idx])

doctopic = doctopic / np.sum(doctopic, axis=1, keepdims=True)
for t in range(len(topic_words)):
    print("Topic {}: {}".format(t, ','.join(topic_words[t][:5])))

修改

示例数据帧(我尝试插入尽可能多的边缘情况),df:

   Content
0  I like red roses as much as I like blue tulips.
1  It would be quite unusual to see red tulips, but not RED ROSES
2  It is almost impossible to find blue roses
3  I like most red flowers, but roses are my favorite.
4  Could you buy me some red roses?
5  John loves the color red. Roses are Mary's favorite flowers.

4 个答案:

答案 0 :(得分:5)

TfidfVectorizer允许自定义预处理器。您可以使用它来进行任何所需的调整。

例如,删除所有连续的" red" +"玫瑰"你的示例语料库中的标记(不区分大小写),使用:

import re
from sklearn.feature_extraction import text

cases = ["I like red roses as much as I like blue tulips.",
         "It would be quite unusual to see red tulips, but not RED ROSES",
         "It is almost impossible to find blue roses",
         "I like most red flowers, but roses are my favorite.",
         "Could you buy me some red roses?",
         "John loves the color red. Roses are Mary's favorite flowers."]

# remove_stop_phrases() is our custom preprocessing function.
def remove_stop_phrases(doc):
    # note: this regex considers "... red. Roses..." as fair game for removal.
    #       if that's not what you want, just use ["red roses"] instead.
    stop_phrases= ["red(\s?\\.?\s?)roses"]
    for phrase in stop_phrases:
        doc = re.sub(phrase, "", doc, flags=re.IGNORECASE)
    return doc

sw = text.ENGLISH_STOP_WORDS
mod_vectorizer = text.TfidfVectorizer(
    ngram_range=(2,3),
    stop_words=sw,
    norm='l2',
    min_df=1,
    preprocessor=remove_stop_phrases  # define our custom preprocessor
)

dtm = mod_vectorizer.fit_transform(cases).toarray()
vocab = np.array(mod_vectorizer.get_feature_names())

现在vocab删除了所有red roses个引用。

print(sorted(vocab))

['Could buy',
 'It impossible',
 'It impossible blue',
 'It quite',
 'It quite unusual',
 'John loves',
 'John loves color',
 'Mary favorite',
 'Mary favorite flowers',
 'blue roses',
 'blue tulips',
 'color Mary',
 'color Mary favorite',
 'favorite flowers',
 'flowers roses',
 'flowers roses favorite',
 'impossible blue',
 'impossible blue roses',
 'like blue',
 'like blue tulips',
 'like like',
 'like like blue',
 'like red',
 'like red flowers',
 'loves color',
 'loves color Mary',
 'quite unusual',
 'quite unusual red',
 'red flowers',
 'red flowers roses',
 'red tulips',
 'roses favorite',
 'unusual red',
 'unusual red tulips']

更新(根据评论主题):

要将所需的停止词组与自定义停用词一起传递给包装函数,请使用:

desired_stop_phrases = ["red(\s?\\.?\s?)roses"]
desired_stop_words = ['Could', 'buy']

def wrapper(stop_words, stop_phrases):

    def remove_stop_phrases(doc):
        for phrase in stop_phrases:
            doc = re.sub(phrase, "", doc, flags=re.IGNORECASE)
        return doc

    sw = text.ENGLISH_STOP_WORDS.union(stop_words)
    mod_vectorizer = text.TfidfVectorizer(
        ngram_range=(2,3),
        stop_words=sw,
        norm='l2',
        min_df=1,
        preprocessor=remove_stop_phrases
    )

    dtm = mod_vectorizer.fit_transform(cases).toarray()
    vocab = np.array(mod_vectorizer.get_feature_names())

    return vocab

wrapper(desired_stop_words, desired_stop_phrases)

答案 1 :(得分:2)

您可以通过传递关键字参数TfidfVectorizer (doc-src)

来切换tokenizer的标记器

原作如下:

def build_tokenizer(self):
    """Return a function that splits a string into a sequence of tokens"""
    if self.tokenizer is not None:
        return self.tokenizer
    token_pattern = re.compile(self.token_pattern)
    return lambda doc: token_pattern.findall(doc)

所以让我们创建一个删除你不想要的所有单词组合的函数。首先让我们定义你不想要的表达式:

unwanted_expressions = [('red','roses'), ('foo', 'bar')]

并且该函数需要看起来像这样:

token_pattern_str = r"(?u)\b\w\w+\b"
def my_tokenizer(doc):
    """split a string into a sequence of tokens
    and remove some words along the way."""

    token_pattern = re.compile(token_pattern_str)
    tokens = token_pattern.findall(doc)
    for i in range(len(tokens)):
        for expr in unwanted_expressions:
            found = True
            for j, word in enumerate(expr):
                found = found and (tokens[i+j] == word)
            if found:
                tokens[i:i+len(expr)] = len(expr) * [None]
    tokens = [x for x in tokens if x is not None]
    return tokens

我自己没有特别尝试过这个,但我之前已经关掉了这个标记器。效果很好。

祝你好运:)

答案 2 :(得分:0)

在将df传递给mod_vectorizer之前,你应该使用类似下一个的东西:

df=["I like red roses as much as I like blue tulips.",
"It would be quite unusual to see red tulips, but not RED ROSES",
"It is almost impossible to find blue roses",
"I like most red flowers, but roses are my favorite.",
"Could you buy me some red roses?",
"John loves the color red. Roses are Mary's favorite flowers."]

df=[ i.lower() for i in df]
df=[i if 'red roses' not in i else i.replace('red roses','') for i in df]

如果您要检查的不仅仅是“红玫瑰”,请用以下内容替换上面的最后一行:

stop_phrases=['red roses']
def filterPhrase(data,stop_phrases):
 for i in range(len(data)):
     for x in stop_phrases:
         if x in data[i]:
             data[i]=data[i].replace(x,'')
 return data
df=filterPhrase(df, stop_phrases)

答案 3 :(得分:-2)

对于Pandas,您希望使用列表压缩

.apply(lambda x: [item for item in x if item not in stop])