Question

我真的需要一些帮助来为 spacy 创建训练数据。我尝试了很多方法来为 spacy 创建训练数据。我从一个单词和实体的 csv 开始，将它们转换为单词和实体列表，将单词组合到句子列表中，将标签组合到每个句子的标签列表中。然后我将它们转换为 json 格式。我现在有多个版本的 json 文件，我想将它们转换为新的 .spacy 格式。但是，使用 --converter ner 后似乎没有任何训练数据起作用，因为它没有找到 NER 格式。

我首先尝试将示例转换为 json 文件

next_sentence = ""
word_index_in_sentence = 0
start_index = list()
end_index = list()
sent_tags = list()
TRAIN_DATA = []
with open("/content/drive/MyDrive/train_file.json", "w+", encoding="utf-8") as f:
    for word_index, word in enumerate(word_list):
        if word_index_in_sentence is 0:
            start_index.append(0)
        else:
            start_index.append((end_index[word_index_in_sentence-1])+1)

        sent_tags.append(tag_list[word_index])

        if word == "." or word == "?" or word == "!" or word_index == len(word_list)-1:
            next_sentence += word
            end_index.append(start_index[word_index_in_sentence]+1)
            entities = "";
            for i in range(word_index_in_sentence):
                if (i != 0):
                    entities += ","
                entities += "(" + str(start_index[i]) + "," + str(end_index[i]) + "," + "'" + sent_tags[i] + "'" + ")"

            f.write('("' + next_sentence + '",{"entities": [' + entities + ']}),')
            next_sentence = ""
            word_index_in_sentence = 0
            start_index = list()
            end_index = list()
            sent_tags = list()
        else:
            if word_list[word_index + 1] == "," or word_list[word_index + 1] == "." or word_list[word_index + 1] == "!" or word_list[word_index + 1] == "?":
                next_sentence += word
                end_index.append(start_index[word_index_in_sentence]+len(word)-1)
            else:
                next_sentence += word + " "
                end_index.append(start_index[word_index_in_sentence]+len(word))
            word_index_in_sentence += 1

因为这没有按预期工作。然后我试图写一个字典的字典列表。所以代替

f.write('("' + next_sentence + '",{"entities": [' + entities + ']}),')

我创建了一个列表 TRAIN_DATA，将值添加为这样的字典：

TRAIN_DATA.append({next_sentence: {"entities":entities}})

再次将 TRAIN_DATA 保存到 json 文件中。

然而，当使用 python -m spacy convert --converter ner /path/to/file /path/to/save 时，它会将其转换为 .spacy，不过，它声明：

<块引用>

⚠ 无法自动检测 NER 格式。转换可能不会成功。参见 https://spacy.io/api/cli#convert ⚠ 没有找到句子边界与选项 -n 1 一起使用。使用 -s 自动分割句子或 -n 0 禁用。 ⚠ 没有找到句子边界。使用 -s 来自动分割句子。 ⚠ 未找到文档分隔符。采用 -n 自动将句子分组到文档中。 ✔ 生成输出文件（1 个文件）： /content/drive/MyDrive/TRAIN_DATA/hope.spacy

我的训练数据在转换为 json 后看起来像这样：

<块引用>

[{"Schartau sagte dem Tagesspiegel vom Freitag, Fischer sei in einer Weise aufgetreten, die alles andere als überzeugend war.": {“实体”： "(0,8,'B-PER'),(9,14,'O'),(15,18,'O'),(19,31,'B-ORG'),(32,35, 'O'),(36,42,'O'),(43,44,'O'),(45,52,'B-PER'),(53,56,'O'),(57, 59,'O'),(60,65,'O'),(66,71,'O'),(72,82,'O'),(83,84,'O'),(85, 88,'O'),(89,94,'O'),(95,101,'O'),(102,105,'O'),(106,117,'O'),(118,120,'O')"} }, {"welt.de vom 29.10.2005 Firmengründer Wolf Peter Bree arbeitete Anfang der siebziger Jahre als Möbelvertreter, als er einen fliegenden Händler aus dem Libanon traf.": {"entities": "(0,22,'[2005-10-29]'),...

或者像这样：

<块引用>

[("Schartau sagte dem Tagesspiegel vom Freitag, Fischer sei in einer Weise aufgetreten, die alles andere als überzeugend war.", {“实体”： (0,8,'B-PER'),(9,14,'O'),(15,18,'O'),(19,31,'B-ORG'),(32,35,' O'),(36,42,'O'),(43,44,'O'),(45,52,'B-PER'),(53,56,'O'),(57,59 ,'O'),(60,65,'O'),(66,71,'O'),(72,82,'O'),(83,84,'O'),(85,88 ,'O'),(89,94,'O'),(95,101,'O'),(102,105,'O'),(106,117,'O'),(118,120,'O')}), ....

python -m spacy debug data /path/to/config

给我输出：

<块引用>

⚠ debug-data 命令现在可以通过“调试数据”使用子命令（不带连字符）。你可以运行 python -m spacy debug --help 概述其他可用的调试命令。

============================ 数据文件验证 ================ ============ ✔ 语料库可加载 ✔ 管道可以用数据初始化

================================ 训练数据 ============== ================ 语言：de 培训管道：transformer，ner 1 培训文档 1 评估文档 ✔ 无重叠在训练和评估数据之间 ✘ 要训练的示例数量很少一个新的管道（一）

============================== 词汇和向量 ============== ================ ℹ 数据中总共有 1 个词（1 个唯一）ℹ 包中不存在词向量

========================== 命名实体识别 ================== ======== ℹ 1 label(s) 0 missing value(s) (tokens with '-' label) ⚠ 标签示例数量少 'stamt",{"entities":[(0,51,"O"),(52,67,"B' (1) ⚠ 没有文本示例没有新标签 'stamt",{"entities":[(0,51,"O"),(52,67,"B' ✔ 否由空格组成或以空格开头/结尾的实体 ✔ 否由标点符号组成或以标点符号开头/结尾的实体

================================== 总结 ============ ====================== ✔ 5 次检查通过 ⚠ 2 次警告 ✘ 1 次错误

有人可以帮我将我的单词和实体列表转换为 spacys NER 格式以训练 NER 吗？我会很感激。谢谢！

Answer 1

此问题已在 Discussions 中回答，但您的数据不是 NER 格式，也不是转换器使用的 json 格式。它是一种用于训练数据的格式，保存为 json。

在这种情况下，最简单的方法可能是将您的数据转换为列式 IOB 数据并在其上运行转换器。

NER：为 Spacy v3 定义训练数据

1 个答案: