Question

问题

我有一个包含 +90,000 行和包含一些新闻文本的列 ['text'] 的数据框。

文本的长度平均为 3.000 个单词，当我通过 word_tokenize 时它会变得非常慢，哪种方法可以更有效地做到这一点？

from nltk.tokenize import word_tokenize
df['tokenized_text'] = df.iloc[0:10]['texto'].apply(word_tokenize) 
df.head()

还有 word_tokenize 没有一些我不想要的标点符号和其他字符，所以我创建了一个函数来过滤它们在我使用 spacy 的地方。

from spacy.lang.es.stop_words import STOP_WORDS
from nltk.corpus import stopwords
spanish_stopwords = set(stopwords.words('spanish'))
otherCharacters = ['`','�',' ','\xa0']
def tokenize(phrase):
    sentence_tokens = []
    tokenized_phrase = nlp(phrase)
    for token in tokenized_phrase:
        if ~token.is_punct or ~token.is_stop or ~(token.text.lower() in spanish_stopwords) or ~(token.text.lower() in otherCharacters) or ~(token.text.lower() in STOP_WORDS):
            sentence_tokens.append(token.text.lower())
    return sentence_tokens

还有其他更好的方法吗？

感谢您阅读我可能是菜鸟??‍?的问题?，祝您有美好的一天?。

感谢

nlp 是在之前定义的

import spacy
import es_core_news_sm
nlp = es_core_news_sm.load()

我使用 spacy 进行标记，但也使用 nltk stop_words 来表示西班牙语。

Answer 1

如果您只是进行分词，请使用空白模型（仅包含分词器）而不是 es_core_news_sm：

nlp = spacy.blank("es")

Answer 2

为了使 spacy 在您只想标记化时更快。
你可以改变：

nlp = es_core_news_sm.load()

致：

nlp = spacy.load("es_core_news_sm", disable=["tagger", "ner", "parser"])

一个小解释：
Spacy 提供了一个完整的语言模型，它不仅可以标记您的句子，还可以进行解析以及 pos 和 ner 标记。当实际上大部分计算时间都用于其他任务（解析树，pos，ner）而不是标记化时，实际上是“更轻”的任务，计算明智。
但是，正如您所见，spacy 允许您仅使用您实际需要的内容，从而节省您一些时间。

另一件事，您可以通过仅降低标记一次并将停用词添加到 spacy 来使您的函数更加传出（即使您不想这样做，事实上 otherCharacters 是一个列表而不是一个集合不是很有效）。

我还要加上这个：

for w in stopwords.words('spanish'):
    nlp.vocab[w].is_stop = True
for w in otherCharacters:
    nlp.vocab[w].is_stop = True
for w in STOP_WORDS:
    nlp.vocab[w].is_stop = True

之后：

for token in tokenized_phrase:
    if not token.is_punct and  not token.is_stop:
        sentence_tokens.append(token.text.lower())

标记文本 - 执行时非常慢

2 个答案: