我想知道当人们想将文档分解成不同的跨度时,Spacy做了什么?例如,说我的语料库在下面创建了一个doc对象。但是对于我正在执行的任务,我想在保持原始对象的同时为不同部分创建索引。
doc = nlp("""
Patient History:
This is paragraph 1.
Assessment:
This is paragraph 2.
Signature:
This is paragraph 3.
""")
然后对其进行解析,例如:
doc.sections_
会产量
["Patient History", "Assessment", "Signature"]
答案 0 :(得分:1)
SpaCy不支持“章节”-它们不是文档的通用功能,定义它们的方式千差万别,具体取决于您是否要处理小说,学术论文,报纸,等
最简单的方法是先将文档拆分成小块,然后再将其送入备用页面。如果格式如您的示例所示,那么使用缩进即可轻松实现。
如果您确实只想拥有一个Doc对象,则应该能够使用spaCy的管道扩展来管理它。请参阅文档here。
答案 1 :(得分:0)
显然,这必须放在文件步骤中,并且尚未针对管道进行优化,但这是我略为棘手的解决方案。
class ParsedNoteSections(object):
"""
Pars notes into sections based on entity-tags. All sections are return as newly
created doc objects.
"""
def __init__(self,doc):
self.doc = doc
def get_section_titles(self):
"""Return the section header titles."""
return [(e,e.start, e.end) for e in self.doc.ents if e.label_ == 'NOTESECTION']
def original(self,doc):
"""Retrieve oringal doc object."""
return self.doc
def __repr__(self):
return repr(self.doc)
def parse_note_sections(self):
""" Use entity sections as break-points to split original doc.
Input:
None
Output:
List of section of objects stored in dictionary.
"""
section_titles = self.get_section_titles()
# stopgap for possible errors
assert len(section_titles) > 0
doc_section_spans = []
for idx,section in enumerate(section_titles):
section_label_new = section[0]
label_start_new = section[1]
label_end_new = section[2]
# store first label
if idx == 0:
section_label_old = section_label_new
continue
# store last section
elif idx == 1:
section_label = section_label_old
section_doc = self.doc[:label_start_new]
# if on the last section
elif idx == len(section_titles) - 1:
section_label = section_label_old
section_doc = self.doc[label_start_old:label_start_new]
doc_section_spans.append({'section_label':section_label, 'section_doc':section_doc})
section_label = section_label_new
section_doc = self.doc[label_start_new:]
# if not storing first or last section
else:
section_label = section_label_old
section_doc = self.doc[label_start_old:label_start_new]
label_start_old = label_start_new
section_label_old = section_label_new
doc_section_spans.append({'section_label':section_label, 'section_doc':section_doc})
assert len(doc_section_spans) == len(section_titles)
return doc_section_spans