我们准备好了一个模型,用于标识自定义命名实体。问题是,如果给出整个文档,那么如果只给出几个句子,该模型就无法按预期运行,那么结果将是惊人的。
我想在带标签的实体前后选择两个句子。
例如如果文档的一部分具有世界Colombo(标记为GPE),则需要在标记之前选择两个句子,在标记之后选择2个句子。我尝试了几种方法,但是复杂性太高了。
是否有内置的spacy方式可以解决此问题?
我正在使用python和spacy。
我尝试通过识别标签的索引来解析文档。但是这种方法真的很慢。
答案 0 :(得分:1)
值得一看的是,您是否可以改进自定义命名实体识别器,这是不常见的,因为额外的上下文会损害性能,并且如果您解决该问题,它可能在整体上会更好地工作。
但是,关于周围句子的具体问题:
Token
或Span
(实体为Span
)具有.sent
属性,该属性为覆盖句提供Span
。如果您在给定句子的开始/结束标记之前/之后查看标记,则可以获取文档中任何标记的上一个/下一个句子。
import spacy
def get_previous_sentence(doc, token_index):
if doc[token_index].sent.start - 1 < 0:
return None
return doc[doc[token_index].sent.start - 1].sent
def get_next_sentence(doc, token_index):
if doc[token_index].sent.end + 1 >= len(doc):
return None
return doc[doc[token_index].sent.end + 1].sent
nlp = spacy.load('en_core_web_lg')
text = "Jane is a name. Here is a sentence. Here is another sentence. Jane was the mayor of Colombo in 2010. Here is another filler sentence. And here is yet another padding sentence without entities. Someone else is the mayor of Colombo right now."
doc = nlp(text)
for ent in doc.ents:
print(ent, ent.label_, ent.sent)
print("Prev:", get_previous_sentence(doc, ent.start))
print("Next:", get_next_sentence(doc, ent.start))
print("----")
输出:
Jane PERSON Jane is a name.
Prev: None
Next: Here is a sentence.
----
Jane PERSON Jane was the mayor of Colombo in 2010.
Prev: Here is another sentence.
Next: Here is another filler sentence.
----
Colombo GPE Jane was the mayor of Colombo in 2010.
Prev: Here is another sentence.
Next: Here is another filler sentence.
----
2010 DATE Jane was the mayor of Colombo in 2010.
Prev: Here is another sentence.
Next: Here is another filler sentence.
----
Colombo GPE Someone else is the mayor of Colombo right now.
Prev: And here is yet another padding sentence without entities.
Next: None
----