从数据框中捕获组织名称

时间:2019-05-02 16:28:41

标签: python python-3.x spacy named-entity-recognition

我有一堆行中包含句子中的文本数据。我正在尝试使用Spacy应用实体提取来获取组织和位置。

我可以传递一个字符串并获取实体。但是,如果我将tgat应用于数据帧,它将失败,这是错误。我不确定是否为for编写了错误的循环或是否正确调用了(X.text,X.label_)?有没有一种方法可以将Spacy应用于数据框行?

数据框不起作用:

import spacy 
from spacy import displacy
import en_core_web_sm
nlp = en_core_web_sm.load()
nlp = spacy.load("en") 

id1 = [1,2,3]
text = ['University of California has great research located in San Diego',np.NaN,'MIT is at Boston']
df = pd.DataFrame({'id':id1,'text':text})
df['text'] = df['text'].astype(str)
print(df)
'''
   id                                                              text
0   1  University of California has great research located in San Diego
1   2                                                               nan
2   3                                                  MIT is at Boston
'''
# works: passing nlp function from spacy 
df['text'] = df['text'].apply(lambda x: nlp(x)) # tokenized it
print(df['text'])

# fails
for row in df.iterrows():
    # getting: AttributeError: 'spacy.tokens.doc.Doc' object has no attribute 'label_'
    test = [(X.text, X.label_) for X in df['text']]
print(test)

字符串正常工作:

sentence = 'University of California has great research located in San Diego'
result = nlp(sentence)
print([(X.text, X.label_) for X in result.ents])
'''
[('University of California', 'ORG'), ('San Diego', 'GPE')]
'''

我如何获得这样的结果?:

   id                                                              text                                                 spacy_results         
0   1  University of California has great research located in San Diego [('University of California', 'ORG'), ('San Diego', 'GPE')]
1   2                                                               nan nan
2   3                                                  MIT is at Boston                         [('MIT', 'ORG'), ('Boston', 'GPE')]

2 个答案:

答案 0 :(得分:0)

text = [[1, 'University of California has great research located in San Diego'],[2, 'MIT is at Boston']]
df = pd.DataFrame(text, columns = ['id', 'text'])
df['new_text'] = df['text'].apply(lambda x: list(nlp(x).ents)) 
print(df["text"])

答案 1 :(得分:0)

这是代码:


text = [[1, 'University of California has great research located in San Diego'],[2, 'MIT is at Boston']]
df = pd.DataFrame(text, columns = ['id', 'text'])

def spacy_entity(df):    
    df1 = nlp(df)
    df2 = [[w.text,w.label_] for w in df1.ents]
    return df2

df1['new_text'] = df1['text'].apply(spacy_entity)

print(df1['new_text'])
0    [[University of California, ORG], [San Diego, ...
1                          [[MIT, ORG], [Boston, GPE]]