使用Python解析(修改的)RIS文件

时间:2017-10-13 09:00:10

标签: python parsing

我有一堆(修改过的)RIS文件。玩具示例如下所示:

Record #1 of 2
ID: CN-01160769
AU: Uedo N
AU: Kasiser R
TI: Development of an E-learning system
SO: United European Gastroenterology Journal
YR: 2015


Record #2 of 2
ID: CN-01070265
AU: Krogh LQ
TI: E-learning in pediatric basic life support
SO: Resuscitation
YR: 2015

简而言之,每条记录以Record #行开头,以两个空行结束。任务是解析文件并提取标签和字段。

下面粘贴的是我当前的代码(改编自here):

import re

class RIS:
    """ RIS file structure """
    def __init__(self, in_file=None):
        """ Initialize and parse input """
        self.records = []
        if in_file:
            self.parse(in_file)

    def parse(self, in_file):
        """ Parse input file """
        self.current_tag = None
        self.current_record = None
        prog = re.compile("^([A-Z][A-Z0-9]): (.*)")
        lines = []
        # Eliminate blank lines
        for line in in_file:
            line = line.strip()
            if len(line) > 0:
                lines.append(line)
        for line in lines:
            match = prog.match(line)
            if match:
                tag = match.groups()[0]
                field = match.groups()[1]
                self.process_field(tag, field)
            else:
                raise ValueError(line)

    def process_field(self, tag, field):
        """ Process RIS file field """
        if tag == "ID":
            self.current_record = {tag: field}
        elif tag == "YR":
            self.records.append(self.current_record)
            self.current_record = None
        elif tag in ["AU", "AD"]:
            if tag in self.current_record:
                self.current_record[tag].append(field)
            else:
                self.current_record[tag] = [field]
        else:
            if not tag in self.current_record:
                self.current_record[tag] = field
            else:
                error_str = "Duplicate tag: %s" % tag
                raise ValueError(error_str)

def main():
    """ Test the code """
    import pprint
    with open("test.ris", "rt") as ris_file:
        ris = RIS(ris_file)
        pp = pprint.PrettyPrinter()
        pp.pprint(ris.records)

if __name__ == "__main__":
    main()

当前代码不起作用,因为它无法识别开始标记(例如Record 1 of 2),另外它不知道记录停止的位置。在当前版本的代码中,我将ID添加为开始代码,将YR添加为停止代码。但是,代码退出时出现错误:

ValueError: Record #1 of 2

非常欢迎任何有关如何正确调整代码的建议。

1 个答案:

答案 0 :(得分:1)

你只需要添加一个判断并打破Record #x of 2行。

import re

class RIS:
    """ RIS file structure """
    def __init__(self, in_file=None):
        """ Initialize and parse input """
        self.records = []
        if in_file:
            self.parse(in_file)

    def parse(self, in_file):
        """ Parse input file """
        self.current_tag = None
        self.current_record = None
        prog = re.compile("^([A-Z][A-Z0-9]): (.*)")
        lines = []
        # Eliminate blank lines
        for line in in_file:
            line = line.strip()
            if len(line) > 0:
                lines.append(line)
        for line in lines:
            if "#" in line:
                continue
            match = prog.match(line)
            if match:
                tag = match.groups()[0]
                field = match.groups()[1]
                self.process_field(tag, field)
            else:
                raise ValueError(line)

    def process_field(self, tag, field):
        """ Process RIS file field """
        if tag == "ID":
            self.current_record = {tag: field}
        elif tag == "YR":
            self.records.append(self.current_record)
            self.current_record = None
        elif tag in ["AU", "AD"]:
            if tag in self.current_record:
                self.current_record[tag].append(field)
            else:
                self.current_record[tag] = [field]
        else:
            if not tag in self.current_record:
                self.current_record[tag] = field
            else:
                error_str = "Duplicate tag: %s" % tag
                raise ValueError(error_str)

def main():
    """ Test the code """
    import pprint
    with open("test.ris", "rt") as ris_file:
        ris = RIS(ris_file)
        pp = pprint.PrettyPrinter()
        pp.pprint(ris.records)

if __name__ == "__main__":
    main()

添加代码:

if "#" in line:
    continue

输出

[{'AU': ['Uedo N', 'Kasiser R'],
  'ID': 'CN-01160769',
  'SO': 'United European Gastroenterology Journal',
  'TI': 'Development of an E-learning system'},
 {'AU': ['Krogh LQ'],
  'ID': 'CN-01070265',
  'SO': 'Resuscitation',
  'TI': 'E-learning in pediatric basic life support'}]