Question

我在许多文本上运行LDA。当我对所产生的主题进行可视化处理时，我发现二元组“ machine_learning”已被词义化为“ machine_learning”和“ machine_learne”。这是我可以提供的最小的可重现示例：

import en_core_web_sm

tokenized = [
    [
        'artificially_intelligent', 'funds', 'generating', 'excess', 'returns',
        'artificial_intelligence', 'deep_learning', 'compelling', 'reasons',
        'join_us', 'artificially_intelligent', 'fund', 'develop', 'ai',
        'machine_learning', 'capabilities', 'real', 'cases', 'big', 'players',
        'industry', 'discover', 'emerging', 'trends', 'latest_developments',
        'ai', 'machine_learning', 'industry', 'players', 'trading',
        'investing', 'live', 'investment', 'models', 'learn', 'develop',
        'compelling', 'business', 'case', 'clients', 'ceos', 'adopt', 'ai',
        'machine_learning', 'investment', 'approaches', 'rare', 'gathering',
        'talents', 'including', 'quants', 'data_scientists', 'researchers',
        'ai', 'machine_learning', 'experts', 'investment_officers', 'explore',
        'solutions', 'challenges', 'potential', 'risks', 'pitfalls',
        'adopting', 'ai', 'machine_learning'
    ],
    [
        'recent_years', 'topics', 'data_science', 'artificial_intelligence',
        'machine_learning', 'big_data', 'become_increasingly', 'popular',
        'growth', 'fueled', 'collection', 'availability', 'data',
        'continually', 'increasing', 'processing', 'power', 'storage', 'open',
        'source', 'movement', 'making', 'tools', 'widely', 'available',
        'result', 'already', 'witnessed', 'profound', 'changes', 'work',
        'rest', 'play', 'trend', 'increase', 'world', 'finance', 'impacted',
        'investment', 'managers', 'particular', 'join_us', 'explore',
        'data_science', 'means', 'finance_professionals'
    ]
]

nlp = en_core_web_sm.load(disable=['parser', 'ner'])

def lemmatization(descrips, allowed_postags=None):
    if allowed_postags is None:
        allowed_postags = ['NOUN', 'ADJ', 'VERB',
                           'ADV']
    lemmatized_descrips = []
    for descrip in descrips:
        doc = nlp(" ".join(descrip))
        lemmatized_descrips.append([
            token.lemma_ for token in doc if token.pos_ in allowed_postags
        ])
    return lemmatized_descrips

lemmatized = lemmatization(tokenized)

print(lemmatized)

您会注意到，在输入tokenized中找不到“ machine_learne”，但是在输出lemmatized中却找到了“ machine_learning”和“ machine_learne”。

这是什么原因，我可以期望它导致其他二元组/三元组出现问题吗？

Answer 1

我认为您误解了POS标记和标签化的过程。

POS标记的依据不是单词本身（我不知道您的母语是哪一种，但是对于许多语言来说是通用的），它还基于其他几种信息，而且还基于周围的单词（例如，一个常见的学习规则是，在许多陈述中，动词通常以名词开头，该名词代表该动词的主体。

当您将所有这些“标记”传递给lemmatizer时，spacy的lemmatizer会尝试“猜测”，这是您单独单词的词性。

在许多情况下，它将使用默认名词，如果它不在常见和不规则名词的查找表中，它将尝试使用通用规则（例如，去除复数“ s”）。

在其他情况下，它会根据某些模式（最后是“ -ing”）使用默认动词，这很可能是您遇到的情况。由于任何词典中都没有动词“ machine_learning”（其模型中没有实例），因此它将采用“ else”路线并应用通用规则。

因此，machine_learning可能会被通用的'inging转换为“ e”'规则（例如在make-> make，bakeing-> bake的情况下）进行限制。许多常规动词。

看这个测试示例：

for descrip in tokenized:
        doc = nlp(" ".join(descrip))
        print([
            (token.pos_, token.text) for token in doc
        ])

输出：

[（'NOUN'，'artificially_intelligent'），（'NOUN'，'funds'），（'VERB'， 'generating'），（'ADJ'，'excess'），（'NOUN'，'returns'），（'NOUN'， 'artificial_intelligence'），（'NOUN'，'deep_learning'），（'ADJ'， '强制'），（'名词'，'原因'），（'PROPN'，'join_us'），（'名词'， 'artificially_intelligent'），（'NOUN'，'fund'），（'NOUN'，'develop'），（'VERB'，'ai'），（'VERB'，'machine_learning'），（'NOUN'， '功能'），（'ADJ'，'真实'），（'名词'，'案例'），（'ADJ'，'大'），（'NOUN'，'players'），（'NOUN'，'industry'），（'VERB'，'discover'），（'VERB'，'emerging'），（'NOUN'，'trends'），（'NOUN'， 'latest_developments'），（'VERB'，'ai'），（'VERB'，'machine_learning'），（“名词”，“行业”），（“名词”，“玩家”），（“名词”，“交易”），（'VERB'，'investing'），（'ADJ'，'live'），（'NOUN'，'investment'），（'NOUN'，'models'），（'VERB'，'learn'），（'VERB'，'develop'），（'ADJ'， '强制'），（'名词'，'商业'），（'名词'，'案例'），（'名词'， '客户'），（'名词'，'ceos'），（'VERB'，'采用'），（'VERB'，'ai'），（'ADJ'，'machine_learning'），（'NOUN'，'investment'），（'NOUN'， 'approaches'），（'ADJ'，'rare'），（'VERB'，'gathering'），（'NOUN'， 'talents'），（'VERB'，'include'），（'NOUN'，'quants'），（'NOUN' 'data_scientists'），（'NOUN'，'researchers'），（'VERB'，'ai'），（'ADJ'， 'machine_learning'），（'NOUN'，'experts'），（'NOUN'， 'investment_legs'），（'VERB'，'explore'），（'NOUN'，'solutions'），（'VERB'，'challenges'），（'ADJ'，'potential'），（'NOUN'，'risks'），（'NOUN'，'pitfalls'），（'VERB'，'采纳'），（'VERB'，'ai'），（'NOUN'， 'machine_learning'）]

您将根据上下文同时获得动词和名词的machine_learning。但是请注意，仅将单词串联起来会使您感到混乱，因为它们没有按照预期的自然语言排序。

甚至没有人能理解并正确POS标记此文本：

人为地，智能地产生超额收益的基金人工智慧深度学习令人信服的原因join_us 人为地_智能基金开发人工智能机器_学习能力大玩家行业发现新兴趋势的真实案例最新_发展人工智能机器_学习行业参与者交易投资实时投资模型学习发展引人注目的业务客户案例ceos采用AI机器_学习投资方法罕见收集包括量子数据在内的人才机器_学习专家投资_军官探索解决方案使用ai machine_learning挑战潜在的风险陷阱

为什么将“ machine_learning”形容为“ machine_learning”和“ machine_learne”？

1 个答案: