我想将数据注释为可以在pridogy(spacy)中使用的特定格式

时间:2019-05-10 17:17:07

标签: regex annotations spacy

我想注释我的数据(按标签[DATE,TIME])  这是具有以下特定格式的句子列表:(dic包括句子的文本,范围是icluse的起始和结束,带有所需的标签)

[{'text': 'On 1600 January 13/2317 at 11h 50m, the right ascension of Mars was: 18 19',
  'spans': [{'start': 7, 'end': 18, 'label': 'TIME'}]},
 {'text': 'Hence, Mars is at 10° 38’ 46” Leo, at an adjusted time of 11h 40m reduced to the meridian of Uraniborg.',
  'spans': []},
 {'text': 'But on January 24/February 3 at the same time it was at 6° 18’ Leo.',
  'spans': [{'start': 6, 'end': 17, 'label': 'DATE'}]}]

我尝试进行注释,并且可以注释和转换为仅适用于一个标签的这种格式

import re
from prodigy.util import write_jsonl

label = "DATE"   # whatever label you want to use
texts = texts  # a list of your texts
regex_patterns = [
                # your expressions – whatever you need
                re.compile(r'\d{4} [A-Z][a-z.]+ \d{2} |\d{4} [A-Z][a-z.]+ \d{2} | [JFMASOND][a-z.]+\s\d{1,2}' ) 
            ]
examples = []
for text in texts:
    for expression in regex_patterns:
        spans = []
    for match in re.finditer(expression, text):
        start, end = match.span()
        span = {"start": start, "end": end, "label": label}
        spans.append(span)
    task = {"text": text, "spans": spans}
    examples.append(task)              

write_jsonl("data_DATE.jsonl", examples)

此代码仅适用于一个标签,如何注释多个标签的数据?

我尝试过此操作,但是很明显它不起作用,仅基于最后一个标签来注释数据


import re
from prodigy.util import write_jsonl

label = ["DATE", "TIME"]   # whatever label you want to use
texts = texts  # a list of your texts
for lbl in label:
        if lbl == "DATE":
            regex_patterns = [
                # your expressions – whatever you need
                re.compile(r'\d{4} [A-Z][a-z.]+ \d{2} |\d{4} [A-Z][a-z.]+ \d{2} | [JFMASOND][a-z.]+\s\d{1,2}' )                  
        ]
        if lbl == "TIME":
             regex_patterns = [
                re.compile(r'\d{1,2}h\s\d{1,2}m | \d{1,2}h' )  
        ]



        examples = []
        for text in texts:
            for expression in regex_patterns:
                spans = []
                for match in re.finditer(expression, text):
                    start, end = match.span()
                    span = {"start": start, "end": end, "label": lbl}
                    spans.append(span)
                task = {"text": text, "spans": spans}
                examples.append(task)

write_jsonl("data.jsonl", examples)

答案应该不是那么困难,我知道但我找不到,非常感谢您的帮助

0 个答案:

没有答案