Character offset in Brat's annotation file through python

时间:2016-07-11 21:58:23

标签: python offset brat

I have two UTF-8 text files:

repr(file1.txt):

\nSTATEMENT OF WORK\n\n\nSTATEMENT OF WORK NO. 7\nEffective Date: February 15, 2015

repr(file2.txt):

RENEWAL/AMENDMENT\n\nTHIS agreement is entered as of July 25, 2014. b

Their respective Brat annotation files have the following annotation:

file1.ann:

T1  date 61 78  February 15, 2015

file2.ann:

T1  date 53 67   July 25, 2014.

But when I use python to retrieve the characters from .txt using above offsets, I get:

file1.read()[61:78]:

February 15, 2015

file2.read()[53:67]:

ly 25, 2014. b

Why does my offsetting work in the first case but not the second case?

1 个答案:

答案 0 :(得分:0)

问题来自以下事实:在Windows和Unix / Mac中,回车符(文本文件中的\ r)和换行符(\ n)并不相同。如果您使用Windows系统来生成或修改.txt文件,则会有一些'\ r \ n',但是brat(对于Windows则不认为)只会计算'\ n'符号。

使用python,打开带有参数dict的文件以确保在创建的{{1中将出现'\ r'后,您可以使用newline=''从Windows计数转换为小子计数}}变量:

W_Contents

此后,初始跨度with open('file.txt', newline='', encoding='utf-8') as f: W_Content = f.read() counter = -1 UfromW_dic = {} for n, char in enumerate(W_Content): if char != '\r': counter += 1 UfromW_dic[n] = counter 将在[x,y]处找到。