我想注释我的数据(按标签[DATE,TIME]) 这是具有以下特定格式的句子列表:(dic包括句子的文本,范围是icluse的起始和结束,带有所需的标签)
[{'text': 'On 1600 January 13/2317 at 11h 50m, the right ascension of Mars was: 18 19',
'spans': [{'start': 7, 'end': 18, 'label': 'TIME'}]},
{'text': 'Hence, Mars is at 10° 38’ 46” Leo, at an adjusted time of 11h 40m reduced to the meridian of Uraniborg.',
'spans': []},
{'text': 'But on January 24/February 3 at the same time it was at 6° 18’ Leo.',
'spans': [{'start': 6, 'end': 17, 'label': 'DATE'}]}]
我尝试进行注释,并且可以注释和转换为仅适用于一个标签的这种格式
import re
from prodigy.util import write_jsonl
label = "DATE" # whatever label you want to use
texts = texts # a list of your texts
regex_patterns = [
# your expressions – whatever you need
re.compile(r'\d{4} [A-Z][a-z.]+ \d{2} |\d{4} [A-Z][a-z.]+ \d{2} | [JFMASOND][a-z.]+\s\d{1,2}' )
]
examples = []
for text in texts:
for expression in regex_patterns:
spans = []
for match in re.finditer(expression, text):
start, end = match.span()
span = {"start": start, "end": end, "label": label}
spans.append(span)
task = {"text": text, "spans": spans}
examples.append(task)
write_jsonl("data_DATE.jsonl", examples)
此代码仅适用于一个标签,如何注释多个标签的数据?
我尝试过此操作,但是很明显它不起作用,仅基于最后一个标签来注释数据
import re
from prodigy.util import write_jsonl
label = ["DATE", "TIME"] # whatever label you want to use
texts = texts # a list of your texts
for lbl in label:
if lbl == "DATE":
regex_patterns = [
# your expressions – whatever you need
re.compile(r'\d{4} [A-Z][a-z.]+ \d{2} |\d{4} [A-Z][a-z.]+ \d{2} | [JFMASOND][a-z.]+\s\d{1,2}' )
]
if lbl == "TIME":
regex_patterns = [
re.compile(r'\d{1,2}h\s\d{1,2}m | \d{1,2}h' )
]
examples = []
for text in texts:
for expression in regex_patterns:
spans = []
for match in re.finditer(expression, text):
start, end = match.span()
span = {"start": start, "end": end, "label": lbl}
spans.append(span)
task = {"text": text, "spans": spans}
examples.append(task)
write_jsonl("data.jsonl", examples)
答案应该不是那么困难,我知道但我找不到,非常感谢您的帮助