Textacy / spacy'subject_verb_object_triples'的更有效实现

时间:2018-12-27 13:11:43

标签: python pandas nlp spacy textacy

我正在尝试通过数据集上的文本实现'extract.subject_verb_object_triples'功能。但是,我编写的代码非常慢且占用大量内存。有更有效的实施方式吗?

    fetch('https://httpbin.org/post', { method: 'POST', body: 'a=1' })
        .then(res => res.json()) // expecting a json response
        .then(json => {
           console.log(json)
           window.location.href = data.redirect;
          });

样本数据(sp500news)

import spacy
import textacy

def extract_SVO(text):

    nlp = spacy.load('en_core_web_sm')
    doc = nlp(text)
    tuples = textacy.extract.subject_verb_object_triples(doc)
    tuples_to_list = list(tuples)
    if tuples_to_list != []:
        tuples_list.append(tuples_to_list)

tuples_list = []          
sp500news['title'].apply(extract_SVO)
print(tuples_list)

1 个答案:

答案 0 :(得分:1)

这应该可以加快速度-

import spacy
import textacy
nlp = spacy.load('en_core_web_sm')
def extract_SVO(text):
    tuples = textacy.extract.subject_verb_object_triples(text)
    if tuples:
        tuples_to_list = list(tuples)
        tuples_list.append(tuples_to_list)

tuples_list = []          
sp500news['title'] = sp500news['title'].apply(nlp)
_ = sp500news['title'].apply(extract_SVO)
print(tuples_list)

说明

在OP实现中,nlp = spacy.load('en_core_web_sm')被调用,因此每次都会从函数内部加载它。我觉得这是最大的瓶颈。可以将其取出并加快速度。

此外,仅当元组不为空时,才能将tuple强制转换为list