Question

我正在使用SpaCy标记成千上万的文档。平均每个文档大约需要5秒钟。关于如何加快令牌生成器的任何建议？

一些其他信息：

输入文件是带有换行符的文本文件
文件的平均大小约为400KB
每个输入文件的令牌都写到输出文件的新行中（尽管如果可以提高速度，我可以更改它）
有1655个停用词
输出文件将输入到fasttext

以下是我的代码：

from pathlib import Path, PurePath
from time import time

st = time()
nlp = en_core_web_sm.load(disable = ['ner', 'tagger', 'parser', 'textcat'])
p = Path('input_text/').glob('*.txt')
files = ['input_text/' + x.name for x in p if x.is_file()]

#nlp = spacy.load('en-core-web-sm')

stopwords_file = 'stopwords.txt'

def getStopWords():
    f = open(stopwords_file, 'r')
    stopWordsSet = f.read()
    return stopWordsSet

stopWordsSet = getStopWords()
out_file = 'token_results.txt'
for file in files:
    #print (out_file)
    with open(file, encoding="utf8") as f:
        st_doc = time()
        for line in f:

            doc = nlp(line)

            for token in doc:
                if (not token.text.lower() in stopWordsSet
                    and not token.is_punct and not token.is_space and not token.like_num
                    and len(token.shape_)>1):                    

                    tup = (token.text, '|', token.lemma_)

                    appendFile = open(out_file, 'a', encoding="utf-8")
                    appendFile.write(" " + tup[0])
        print((time() -st_doc), 'seconds elasped for', file)
        appendFile.write('\n')
        appendFile.close()
print((time()-st)/60, 'minutes elasped')

Answer 1

主要问题：打开输出文件一次，并保持打开状态直到脚本结束。反复关闭并重新打开，然后搜索到更大的文本文件的末尾将非常慢。
将停用词读入实际的set()。否则，您将在包含整个文件的长字符串中搜索每个令牌，该字符串会意外地匹配部分单词，并且比检查集合成员资格要慢得多。
使用nlp.pipe（）或仅通过nlp.tokenizer.pipe（）进行令牌化即可加快spacy部分的速度。有一堆简短的单句文档，这似乎并没有太大的不同。标记一个大文档比将每一行视为一个单独的文档要快得多，但是是否要这样做取决于数据的结构。如果只是标记，则可以根据需要增加最大文档大小（nlp.max_length）。

texts = f.readlines()
docs = nlp.tokenizer.pipe(texts)

for doc in docs:
    for token in doc:
        ...

加快SpaCy令牌生成器的速度

1 个答案: