Question

我有一个包含表的文档，我想提取上下文信息（例如句子或任何其他形式），以便我可以对其进行标记并构建命名的实体识别器。

有人知道我们如何构建上下文训练数据来训练命名实体识别器，或者如何注释表数据来训练命名实体识别器。

Answer 1

您可以尝试使用Spacy构建客户NER。该脚本可以适应您的需求。

Spacy NER格式：[https://dataturks.com/help/dataturks-ner-json-to-spacy-train.php]

如果您有非常固定的PDF布局，则可以将其转换为“文本”并使用此工具进行注释。这样可以免费提供类似于Prodigy的注释体验。

Doccano注释工具：[https://github.com/chakki-works/doccano]

Answer 2

spacy-annotator是解决您的问题的好方法。

它允许您使用ipywidgets在文本中注释自定义实体。
此外，注释器还会以spaCy喜欢的nlp库的格式生成输出。

注释示例：

import pandas as pd
import re
from spacy_annotator.pandas_annotations import annotate as pd_annotate

# Data
df = pd.DataFrame.from_dict({'full_text' : ['New York is lovely but Milan is amazing!']})

# Annotations
pd_dd = pd_annotate(df,
            col_text = 'full_text',     # Column in pandas dataframe containing text to be labelled
            labels = ['GPE', 'PERSON'], # List of labels
            sample_size=1,              # Size of the sample to be labelled
            delimiter='~',              # Delimiter to separate entities in GUI
            model = None,               # spaCy model for noisy pre-labelling
            regex_flags=re.IGNORECASE   # One (or more) regex flags to be applied when searching for entities in text
            )

# Example output
pd_dd['annotations'][0]

如何从表中提取上下文数据以训练自定义命名实体识别器？

2 个答案: