Question

I have two UTF-8 text files:

repr(file1.txt):

\nSTATEMENT OF WORK\n\n\nSTATEMENT OF WORK NO. 7\nEffective Date: February 15, 2015

repr(file2.txt):

RENEWAL/AMENDMENT\n\nTHIS agreement is entered as of July 25, 2014. b

Their respective Brat annotation files have the following annotation:

file1.ann:

T1  date 61 78  February 15, 2015

file2.ann:

T1  date 53 67   July 25, 2014.

But when I use python to retrieve the characters from .txt using above offsets, I get:

file1.read()[61:78]:

February 15, 2015

file2.read()[53:67]:

ly 25, 2014. b

Why does my offsetting work in the first case but not the second case?

Answer 1

问题来自以下事实：在Windows和Unix / Mac中，回车符（文本文件中的\ r）和换行符（\ n）并不相同。如果您使用Windows系统来生成或修改.txt文件，则会有一些'\ r \ n'，但是brat（对于Windows则不认为）只会计算'\ n'符号。

使用python，打开带有参数dict的文件以确保在创建的{{1中将出现'\ r'后，您可以使用newline=''从Windows计数转换为小子计数}}变量：

W_Contents

此后，初始跨度with open('file.txt', newline='', encoding='utf-8') as f: W_Content = f.read() counter = -1 UfromW_dic = {} for n, char in enumerate(W_Content): if char != '\r': counter += 1 UfromW_dic[n] = counter将在[x,y]处找到。

Character offset in Brat's annotation file through python

1 个答案: