我正在使用包含两组.txt文件的带注释的语料库。第一组包含注释的文档(即文章,博客文章等),第二组包含实际注释。将注释与注释文本匹配的方法是通过“字节跨度”。来自自述文件:
"The span is the starting and ending byte of the annotation in
the document. For example, the annotation listed above is from
the document, temp_fbis/20.20.10-3414. The span of this annotation
is 730,740. This means that the start of this annotation is
byte 730 in the file docs/temp_fbis/20.20.10-3414, and byte 740
is the character after the last character of the annotation."
所以,问题:怎么做我索引文档中的开始和结束字节,以便我可以将注释与原始文档中的文本相匹配?有任何想法吗?我在Python上工作......
答案 0 :(得分:0)
"This means that the start of this annotation is
byte 730 in the file docs/temp_fbis/20.20.10-3414, and byte 740
is the character after the last character of the annotation.
blah, blah, blah, example annotation, blah, blah, blah
| |
start byte end byte
The data_type of all annotations should be 'string'."
答案 1 :(得分:0)
#open, seek, read
start, end = 730,740
f = open("myfile", "rb")
try:
f.seek(start)
while start > end
byte = f.read(1)
# Do stuff with byte.
start -= 1
finally:
f.close()