Question

考虑这个最小数据框

energy[energy.index=='South Korea']

我知道我可以使用import spacy nlp = spacy.load('en_core_web_sm') import pandas as pd import numpy as np mydata = pd.DataFrame({'text' : [u'the cat eats the dog. the dog eats the cat']})在我的文本列上运行spacy：

apply

但是，我想做些更细微的事情：

如何使用词性标记和

mydata['parsed'] = mydata.text.apply(lambda x: nlp(x))

提取主题为dog的句子？

输出应为下面的spacy列：

extracted

谢谢！

Answer 1

这实际上不是一个pandas问题。您遇到三个问题：

将每个字符串拆分为多个句子
确定每个句子中的主题
如果主题为dog，请返回句子

1。。我们可以使用list方法将字符串拆分为split()。

my_string = "the dog ate the bread. the cat ate the bread"
sentences = my_string.split('.')

2。。根据Spacy文档，在nlp()上调用string将给我们一个Doc，其中包含tokens，而这些{一些properties附加在他们身上。

我们感兴趣的property是dep_，因为它将告诉我们token与其他tokens之间的关系，即我们的token是不是主题。

您可以在此处找到属性列表：https://spacy.io/usage/linguistic-features

doc = nlp(my_string)

for token in doc:
    print(token.dep_)  # if this prints `nsubj` the token is a noun subject!

3。。要检查token是否等于'dog'，我们需要从令牌中获取text属性：

token.text

如果我们扩大规模：

NLP = spacy.load('en_core_web_sm')

def extract_sentence_based_on_subject(string, subject):

    sentences = string.split('.')

    for sentence in sentences:
        doc = NLP(sentence)
        for token in doc:
            if token.dep_ == 'nsubj':
                if token.text == subject:
                    return sentence


mydata['text'].apply(extract_sentence_based_on_subject, subject='dog')

如何解析一个特定的句子？

1 个答案: