Question

我在python中构建了一个文本分类器，我有一个每个类的关键短语列表。例如，课程可以是＆＃34;旅行＆＃34;和＆＃34;科学＆＃34;和列表可以包含：

旅行：＆＃34;纽约＆＃34;，＆＃34;韩国＆＃34;，＆＃34;首尔＆＃34;等
科学：＆＃34;科学家＆＃34;，＆＃34;化学品＆＃34;等

我正在寻找匹配python中此类列表中的短语的最佳方法。

例如，文档的结果：

一位着名科学家从纽约前往韩国首尔

应该是：＆＃34;科学＆＃34;：1 ＆＃34;旅行＆＃34;：3

即使＆＃34; in＆＃34;字符串的运算符已经过优化，还有一些情况需要处理：

字边界：在某些时候我可以＆＃34;到＃34;在字典中，并不想匹配＆＃34;到＆＃34;在＆＃34;明天＆＃34;。在这种情况下，标记化将起作用，但短语可能需要一些自定义逻辑，也可能在标记列表中进行子列表查找。
阻止：＆＃34;科学家发现＆＃34;当有科学家发现时，也应该匹配＃34;在列表中

是否有一个可以有效处理这个问题的python库？如果我需要从头开始实现它，那么在性能方面处理上述问题的最佳方法是什么？

Answer 1

你试图实现的是对词干的短语搜索。它是文本挖掘我认为并在搜索引擎中实现的任务。

首先，您需要tokenize和stemmer个功能。 Tokenize可以就像这样简单：

def tokenize(string):
    return fiter(lambda x: len(x) < 1, remove_punctuation(string).split())

pypi上有各种词干分析器。

您最终会得到如下函数：

def preprocess(string):
    return [stemmer(word) for word in tokenize(string)]

然后您正在寻找的功能如下所示：

from collections import Counter


def count(dictionary, phrase):
    counter = Count()
    phrase = preprocess(phrase)
    for topic, string in dictionary.items():
        stems = preprocess(string)
        indices = find(phrase, stem[0])
        for index in indices:
            found = True
            for stem in stems[1:]:
                if phrase[index + 1] == stem:
                   continue
                else:
                   found = False
                   break
            if found:
               counter[topic] +=1
    return counter

dictionary变量包含以下信息：

旅行：＆＃34;纽约＆＃34;，＆＃34;韩国＆＃34;，＆＃34;首尔＆＃34;等
科学：＆＃34;科学家＆＃34;，＆＃34;化学品＆＃34;等

Answer 2

在这种情况下，一个简单的解决方案是使用字典理解：

s = "A famous scientist traveled from New York to Seoul, South Korea"
d = {"travel":["New York", "South Korea", "Seoul"], "science": ["scientist", "chemical"]}
final_results = {a:sum(i in s for i in b) for a, b in d.items()}

输出：

{'science': 1, 'travel': 3}

快速词典查找短语和词干在python中

2 个答案: