如何正确阅读Python

时间:2017-10-13 16:36:01

标签: python parsing

这是我之前发布的here问题的延续,我正在努力解析RIS文件。但是,现在我已将一些代码组合到一个新的解析器中,该解析器正确读取记录。不幸的是,代码在第一个记录之后停止,而我不知道如何区分文件结尾和分隔记录的双新闻空间字符。有什么想法吗?

输入文件在此处提供:

Record #1 of 306
ID: CN-01160769
AU: Uedo N
AU: Yao K
AU: Muto M
AU: Ishikawa H
TI: Development of an E-learning system.
SO: United European Gastroenterology Journal
YR: 2015
VL: 3
NO: 5 SUPPL. 1
PG: A490
XR: EMBASE 72267184
PT: Journal: Conference Abstract
DOI: 10.1177/2050640615601623
US: http://onlinelibrary.wiley.com/o/cochrane/clcentral/articles/769/CN-01160769/frame.html


Record #2 of 306
ID: CN-01070265
AU: Krogh LQ
AU: Bjornshave K
AU: Vestergaard LD
AU: Sharma MB
AU: Rasmussen SE
AU: Nielsen HV
AU: Thim T
AU: Lofgren B
TI: E-learning in pediatric basic life support: A randomized controlled non-inferiority study.
SO: Resuscitation
YR: 2015
VL: 90
PG: 7-12
XR: EMBASE 2015935529
PT: Journal: Article
DOI: 10.1016/j.resuscitation.2015.01.030
US: http://onlinelibrary.wiley.com/o/cochrane/clcentral/articles/265/CN-01070265/frame.html


Record #3 of 306
ID: CN-00982835
AU: Worm BS
AU: Jensen K
TI: Does peer learning or higher levels of e-learning improve learning abilities?
SO: Medical education online
YR: 2013
VL: 18
NO: 1
PG: 21877
PM: PUBMED 28166018
XR: EMBASE 24229729
PT: Journal Article; Randomized Controlled Trial
DOI: 10.3402/meo.v18i0.21877
US: http://onlinelibrary.wiley.com/o/cochrane/clcentral/articles/835/CN-00982835/frame.html

代码贴在下面:

import re

# Function to process single record
def read_record(infile):
    line = infile.readline()
    line = line.strip()

    if not line:
        # End of file
        return None

    if not line.startswith("Record"):
        raise TypeError("Not a proper file: %r" % line)

    # Read tags and fields
    tags = []
    fields = []
    while 1:
        line = infile.readline().rstrip()
        if line == "":
            # Reached the end of the record or end of the file
            break
        prog = re.compile("^([A-Z][A-Z0-9][A-Z]?): (.*)")
        match = prog.match(line)
        tag = match.groups()[0]
        field = match.groups()[1]
        tags.append(tag)
        fields.append(field)

    return [tags, fields]


# Function to loop through records
def read_records(input_file):
    records = []
    while 1:
        record = read_record(input_file)
        if record is None:
            break
        records.append(record)
    return records


infile = open("test.txt")

for record in read_records(infile):
  print(record)

2 个答案:

答案 0 :(得分:1)

了解如何使用for line in infile:逐行迭代文件。无需使用""来测试文件结尾,for循环迭代将为您执行此操作:

for line in infile:
    # remove trailing newlines, and truncate lines that 
    # are all-whitespace down to just ''
    line = line.rstrip()

    if line:
        # there is something on this line
    else:
        # this is a blank line - but it is definitely NOT the end-of-file

答案 1 :(得分:0)

正如@PaulMcG所建议的,这是一个逐行迭代文件的解决方案。

import re

records = []
count_records = 0
count_newlines = 0
prog = re.compile("^([A-Z][A-Z0-9][A-Z]?): (.*)")
bom = re.compile("^\ufeff")
with open("test.ris") as infile:
    for line in infile:
        line = line.rstrip()
        if bom.match(line):
            line = re.sub("^\ufeff", "", line)
        if line:
            if line.startswith("Record"):
                print("START NEW RECORD")
                count_records += 1
                count_newlines = 0
                current_record = {}
                continue
            match = prog.match(line)
            tag = match.groups()[0]
            field = match.groups()[1]
            if tag == "AU":
                if tag in current_record:
                    current_record[tag].append(field)
                else:
                    current_record[tag] = [field]
            else:
                current_record.update({tag: field})
        else:
            count_newlines += 1
            if count_newlines > 1 and count_records > 0:
                print("# of records: ", count_records)
                print("# of newlines: ", count_newlines)
                records.append(current_record)