我想尝试用Conll 2003数据决定NER任务 我已经看到了很多的信息,如何准备数据集火车,但是这一切都不同了,was't全面的。
首先,我将这些数据转换为句子
def read_file(path):
sentences = []
sentence = []
with open(path, "r", encoding="utf-8") as f:
f = f.read().split("\\n")
for line in f:
line = line.strip()
if line.startswith("b'-DOCSTART-"):
continue
elif len(line) == 0:
if len(sentence) > 0:
sentences.append(sentence)
sentence = []
continue
try:
sentence.append((" ".join(line.split(" ")[:-3]),
line.split(" ")[-3],
line.split(" ")[-2],
line.split(" ")[-1]))
except Exception as e:
print(e, "line: ", line)
if len(sentence) > 0:
sentences.append(sentence)
return sentences
部分输出看起来像
[('EU', 'NNP', 'I-NP', 'I-ORG'),
('rejects', 'VBZ', 'I-VP', 'O'),
('German', 'JJ', 'I-NP', 'I-MISC'),
('call', 'NN', 'I-NP', 'O'),
('to', 'TO', 'I-VP', 'O'),
('boycott', 'VB', 'I-VP', 'O'),
('British', 'JJ', 'I-NP', 'I-MISC'),
('lamb', 'NN', 'I-NP', 'O'),
('.', '.', 'O', 'O')]
我应该在NER管道的下一步做什么以准备要训练的数据?