空间:优化令牌化

时间:2018-10-19 17:41:59

标签: machine-learning nlp spacy

我目前正在尝试标记文本文件,其中每一行都是一条推文的正文:

"According to data reported to FINRA, short volume percent for $SALT clocked in at 39.19% on 12-29-17 http://www.volumebot.com/?s=SALT"
"@Good2go @krueb The chart I posted definitely supports ng going lower.  Gobstopper' 2.12, might even be conservative."
"@Crypt0Fortune Its not dumping as bad as it used to...."
"$XVG.X LOL. Someone just triggered a cascade of stop-loss orders and scooped up morons' coins. Oldest trick in the stock trader's book."

文件长59,397行(一天的数据量),我正在使用spaCy进行预处理/令牌化。目前大约要花8.5分钟,我想知道是否有任何方法可以更快地优化以下代码,因为此过程需要8.5分钟的时间太长了:

def token_loop(path):
    store = []
    files = [f for f in listdir(path) if isfile(join(path, f))]

    start_time = time.monotonic()
    for filename in files:
        with open("./data/"+filename) as f:
            for line in f:
                tokens = nlp(line.lower())
                tokens = [token.lemma_ for token in tokens if not token.orth_.isspace() and token.is_alpha and not token.is_stop and len(token.orth_) != 1]
                store.append(tokens)

    end_time = time.monotonic()
    print("Time taken to tokenize:",timedelta(seconds=end_time - start_time))

    return store

尽管它说的是文件,但目前仅循环1个文件。

仅需注意,我只需要用此标记内容即可;我不需要任何额外的标记等。

2 个答案:

答案 0 :(得分:0)

听起来您还没有优化管道。通过禁用不需要的管道组件,您将获得显着的速度,就像这样:

nlp = spacy.load('en', disable=['parser', 'tagger', 'ner'])    

这应该使您自己下降到大约两分钟,甚至更好。

如果需要进一步提高速度,可以使用nlp.pipe查看多线程。多线程文档在这里: https://spacy.io/usage/processing-pipelines#section-multithreading

答案 1 :(得分:0)

您可以使用"cc"代替nlp(line)来加快处理速度

请参阅Spacy的文档-https://spacy.io/usage/processing-pipelines