这是我之前发布的here问题的延续,我正在努力解析RIS文件。但是,现在我已将一些代码组合到一个新的解析器中,该解析器正确读取记录。不幸的是,代码在第一个记录之后停止,而我不知道如何区分文件结尾和分隔记录的双新闻空间字符。有什么想法吗?
输入文件在此处提供:
Record #1 of 306
ID: CN-01160769
AU: Uedo N
AU: Yao K
AU: Muto M
AU: Ishikawa H
TI: Development of an E-learning system.
SO: United European Gastroenterology Journal
YR: 2015
VL: 3
NO: 5 SUPPL. 1
PG: A490
XR: EMBASE 72267184
PT: Journal: Conference Abstract
DOI: 10.1177/2050640615601623
US: http://onlinelibrary.wiley.com/o/cochrane/clcentral/articles/769/CN-01160769/frame.html
Record #2 of 306
ID: CN-01070265
AU: Krogh LQ
AU: Bjornshave K
AU: Vestergaard LD
AU: Sharma MB
AU: Rasmussen SE
AU: Nielsen HV
AU: Thim T
AU: Lofgren B
TI: E-learning in pediatric basic life support: A randomized controlled non-inferiority study.
SO: Resuscitation
YR: 2015
VL: 90
PG: 7-12
XR: EMBASE 2015935529
PT: Journal: Article
DOI: 10.1016/j.resuscitation.2015.01.030
US: http://onlinelibrary.wiley.com/o/cochrane/clcentral/articles/265/CN-01070265/frame.html
Record #3 of 306
ID: CN-00982835
AU: Worm BS
AU: Jensen K
TI: Does peer learning or higher levels of e-learning improve learning abilities?
SO: Medical education online
YR: 2013
VL: 18
NO: 1
PG: 21877
PM: PUBMED 28166018
XR: EMBASE 24229729
PT: Journal Article; Randomized Controlled Trial
DOI: 10.3402/meo.v18i0.21877
US: http://onlinelibrary.wiley.com/o/cochrane/clcentral/articles/835/CN-00982835/frame.html
代码贴在下面:
import re
# Function to process single record
def read_record(infile):
line = infile.readline()
line = line.strip()
if not line:
# End of file
return None
if not line.startswith("Record"):
raise TypeError("Not a proper file: %r" % line)
# Read tags and fields
tags = []
fields = []
while 1:
line = infile.readline().rstrip()
if line == "":
# Reached the end of the record or end of the file
break
prog = re.compile("^([A-Z][A-Z0-9][A-Z]?): (.*)")
match = prog.match(line)
tag = match.groups()[0]
field = match.groups()[1]
tags.append(tag)
fields.append(field)
return [tags, fields]
# Function to loop through records
def read_records(input_file):
records = []
while 1:
record = read_record(input_file)
if record is None:
break
records.append(record)
return records
infile = open("test.txt")
for record in read_records(infile):
print(record)
答案 0 :(得分:1)
了解如何使用for line in infile:
逐行迭代文件。无需使用""来测试文件结尾,for循环迭代将为您执行此操作:
for line in infile:
# remove trailing newlines, and truncate lines that
# are all-whitespace down to just ''
line = line.rstrip()
if line:
# there is something on this line
else:
# this is a blank line - but it is definitely NOT the end-of-file
答案 1 :(得分:0)
正如@PaulMcG所建议的,这是一个逐行迭代文件的解决方案。
import re
records = []
count_records = 0
count_newlines = 0
prog = re.compile("^([A-Z][A-Z0-9][A-Z]?): (.*)")
bom = re.compile("^\ufeff")
with open("test.ris") as infile:
for line in infile:
line = line.rstrip()
if bom.match(line):
line = re.sub("^\ufeff", "", line)
if line:
if line.startswith("Record"):
print("START NEW RECORD")
count_records += 1
count_newlines = 0
current_record = {}
continue
match = prog.match(line)
tag = match.groups()[0]
field = match.groups()[1]
if tag == "AU":
if tag in current_record:
current_record[tag].append(field)
else:
current_record[tag] = [field]
else:
current_record.update({tag: field})
else:
count_newlines += 1
if count_newlines > 1 and count_records > 0:
print("# of records: ", count_records)
print("# of newlines: ", count_newlines)
records.append(current_record)