Question

我执行了以下功能来清除数据集的文本注释：

import spacy
nlp = spacy.load("en")
def clean(text):
    """
    Text preprocessing for english text
    """
    # Apply spacy to the text
    doc=nlp(text)
    # Lemmatization, remotion of noise (stopwords, digit, puntuaction and singol characters)
    tokens=[token.lemma_.strip() for token in doc if 
            not token.is_stop and not nlp.vocab[token.lemma_].is_stop # Remotion StopWords
            and not token.is_punct # Remove puntuaction
            and not token.is_digit # Remove digit
           ]
    # Recreation of the text
    text=" ".join(tokens)

    return text.lower()

问题是当我想清除所有数据集文本时，需要花费数小时的时间。（我的数据集是7万行，每行100到5000个单词）

我试图使用swifter在多线程上运行apply方法，例如：data.note_line_comment.swifter.apply(clean)

但是并没有真正改善，因为它花费了将近一个小时。

我想知道是否有任何方法可以使函数向量化，或者有其他方法可以加快处理速度。有什么想法吗？

Answer 1

简短回答

这种类型的问题固有地需要时间。

长答案

使用正则表达式
更改spacy管道

做出决定所需的字符串信息越多，花费的时间就越长。

好消息是，如果您对文本的清理相对简化，那么一些正则表达式可能会解决问题。

否则，您将使用spacy管道来帮助删除一些文本，这很昂贵，因为默认情况下它会执行很多操作：

令牌化
合法化
依赖项解析
NER
分组

或者，您可以再次尝试执行任务，并关闭不需要的spacy管道的各个方面，这可能会大大加快速度。

例如，也许关闭命名实体识别，标记和依赖项解析...

nlp = spacy.load("en", disable=["parser", "tagger", "ner"])

然后重试，它将加快速度。

NLP

1 个答案: