I have two UTF-8 text files:
repr(file1.txt):
\nSTATEMENT OF WORK\n\n\nSTATEMENT OF WORK NO. 7\nEffective Date: February 15, 2015
repr(file2.txt):
RENEWAL/AMENDMENT\n\nTHIS agreement is entered as of July 25, 2014. b
Their respective Brat annotation files have the following annotation:
file1.ann:
T1 date 61 78 February 15, 2015
file2.ann:
T1 date 53 67 July 25, 2014.
But when I use python to retrieve the characters from .txt using above offsets, I get:
file1.read()[61:78]:
February 15, 2015
file2.read()[53:67]:
ly 25, 2014. b
Why does my offsetting work in the first case but not the second case?
答案 0 :(得分:0)
问题来自以下事实:在Windows和Unix / Mac中,回车符(文本文件中的\ r)和换行符(\ n)并不相同。如果您使用Windows系统来生成或修改.txt文件,则会有一些'\ r \ n',但是brat(对于Windows则不认为)只会计算'\ n'符号。
使用python,打开带有参数dict
的文件以确保在创建的{{1中将出现'\ r'后,您可以使用newline=''
从Windows计数转换为小子计数}}变量:
W_Contents
此后,初始跨度with open('file.txt', newline='', encoding='utf-8') as f:
W_Content = f.read()
counter = -1
UfromW_dic = {}
for n, char in enumerate(W_Content):
if char != '\r':
counter += 1
UfromW_dic[n] = counter
将在[x,y]
处找到。