如何使用Python解析复杂的文本文件?

时间:2017-12-26 19:31:51

标签: python regex pandas parsing

我正在寻找一种将复杂文本文件解析为pandas DataFrame的简单方法。下面是一个示例文件,我希望解析后的结果和我当前的方法。

有没有办法让它更简洁/更快/更pythonic /更可读?

我也把这个问题放在Code Review上。

我最终写了blog article to explain this to beginners

以下是一个示例文件:

Sample text

A selection of students from Riverdale High and Hogwarts took part in a quiz. This is a record of their scores.

School = Riverdale High
Grade = 1
Student number, Name
0, Phoebe
1, Rachel

Student number, Score
0, 3
1, 7

Grade = 2
Student number, Name
0, Angela
1, Tristan
2, Aurora

Student number, Score
0, 6
1, 3
2, 9

School = Hogwarts
Grade = 1
Student number, Name
0, Ginny
1, Luna

Student number, Score
0, 8
1, 7

Grade = 2
Student number, Name
0, Harry
1, Hermione

Student number, Score
0, 5
1, 10

Grade = 3
Student number, Name
0, Fred
1, George

Student number, Score
0, 0
1, 0

以下是解析后我想要的结果:

                                         Name  Score
School         Grade Student number                 
Hogwarts       1     0                  Ginny      8
                     1                   Luna      7
               2     0                  Harry      5
                     1               Hermione     10
               3     0                   Fred      0
                     1                 George      0
Riverdale High 1     0                 Phoebe      3
                     1                 Rachel      7
               2     0                 Angela      6
                     1                Tristan      3
                     2                 Aurora      9

以下是我目前解析它的方式:

import re
import pandas as pd


def parse(filepath):
    """
    Parse text at given filepath

    Parameters
    ----------
    filepath : str
        Filepath for file to be parsed

    Returns
    -------
    data : pd.DataFrame
        Parsed data

    """

    data = []
    with open(filepath, 'r') as file:
        line = file.readline()
        while line:
            reg_match = _RegExLib(line)

            if reg_match.school:
                school = reg_match.school.group(1)

            if reg_match.grade:
                grade = reg_match.grade.group(1)
                grade = int(grade)

            if reg_match.name_score:
                value_type = reg_match.name_score.group(1)
                line = file.readline()
                while line.strip():
                    number, value = line.strip().split(',')
                    value = value.strip()
                    dict_of_data = {
                        'School': school,
                        'Grade': grade,
                        'Student number': number,
                        value_type: value
                    }
                    data.append(dict_of_data)
                    line = file.readline()

            line = file.readline()

        data = pd.DataFrame(data)
        data.set_index(['School', 'Grade', 'Student number'], inplace=True)
        # consolidate df to remove nans
        data = data.groupby(level=data.index.names).first()
        # upgrade Score from float to integer
        data = data.apply(pd.to_numeric, errors='ignore')
    return data


class _RegExLib:
    """Set up regular expressions"""
    # use https://regexper.com to visualise these if required
    _reg_school = re.compile('School = (.*)\n')
    _reg_grade = re.compile('Grade = (.*)\n')
    _reg_name_score = re.compile('(Name|Score)')

    def __init__(self, line):
        # check whether line has a positive match with all of the regular expressions
        self.school = self._reg_school.match(line)
        self.grade = self._reg_grade.match(line)
        self.name_score = self._reg_name_score.search(line)


if __name__ == '__main__':
    filepath = 'sample.txt'
    data = parse(filepath)
    print(data)

4 个答案:

答案 0 :(得分:25)

更新2019(PEG解析器):

这个答案得到了很多关注,所以我觉得要增加另一种可能性,即解析选项。在这里,我们可以使用NodeVisitor解析器(例如parsimonious)与from parsimonious.grammar import Grammar from parsimonious.nodes import NodeVisitor import pandas as pd grammar = Grammar( r""" schools = (school_block / ws)+ school_block = school_header ws grade_block+ grade_block = grade_header ws name_header ws (number_name)+ ws score_header ws (number_score)+ ws? school_header = ~"^School = (.*)"m grade_header = ~"^Grade = (\d+)"m name_header = "Student number, Name" score_header = "Student number, Score" number_name = index comma name ws number_score = index comma score ws comma = ws? "," ws? index = number+ score = number+ number = ~"\d+" name = ~"[A-Z]\w+" ws = ~"\s*" """ ) tree = grammar.parse(data) class SchoolVisitor(NodeVisitor): output, names = ([], []) current_school, current_grade = None, None def _getName(self, idx): for index, name in self.names: if index == idx: return name def generic_visit(self, node, visited_children): return node.text or visited_children def visit_school_header(self, node, children): self.current_school = node.match.group(1) def visit_grade_header(self, node, children): self.current_grade = node.match.group(1) self.names = [] def visit_number_name(self, node, children): index, name = None, None for child in node.children: if child.expr.name == 'name': name = child.text elif child.expr.name == 'index': index = child.text self.names.append((index, name)) def visit_number_score(self, node, children): index, score = None, None for child in node.children: if child.expr.name == 'index': index = child.text elif child.expr.name == 'score': score = child.text name = self._getName(index) # build the entire entry entry = (self.current_school, self.current_grade, index, name, score) self.output.append(entry) sv = SchoolVisitor() sv.visit(tree) df = pd.DataFrame.from_records(sv.output, columns = ['School', 'Grade', 'Student number', 'Name', 'Score']) print(df) 类结合使用:

^
School\s*=\s*(?P<school_name>.+)
(?P<school_content>[\s\S]+?)
(?=^School|\Z)

正则表达式选项(原始答案)

那么,看第五次指环王,我不得不把时间缩短到最后的结局:

<小时/> 细分,想法是将问题分解为几个较小的问题:

  1. 将每所学校分开
  2. ...每个年级
  3. ...学生和分数
  4. ...之后在数据框中将它们绑定在一起
  5. <小时/> 学校部分(见a demo on regex101.com

    ^
    Grade\s*=\s*(?P<grade>.+)
    (?P<students>[\s\S]+?)
    (?=^Grade|\Z)
    

    <小时/> 成绩部分(another demo on regex101.com

    ^
    Student\ number,\ Name[\n\r]
    (?P<student_names>(?:^\d+.+[\n\r])+)
    \s*
    ^
    Student\ number,\ Score[\n\r]
    (?P<student_scores>(?:^\d+.+[\n\r])+)
    

    <小时/> 学生/分数部分(last demo on regex101.com):

    DataFrame

    其余的是一个生成器表达式,然后将其输入import pandas as pd, re rx_school = re.compile(r''' ^ School\s*=\s*(?P<school_name>.+) (?P<school_content>[\s\S]+?) (?=^School|\Z) ''', re.MULTILINE | re.VERBOSE) rx_grade = re.compile(r''' ^ Grade\s*=\s*(?P<grade>.+) (?P<students>[\s\S]+?) (?=^Grade|\Z) ''', re.MULTILINE | re.VERBOSE) rx_student_score = re.compile(r''' ^ Student\ number,\ Name[\n\r] (?P<student_names>(?:^\d+.+[\n\r])+) \s* ^ Student\ number,\ Score[\n\r] (?P<student_scores>(?:^\d+.+[\n\r])+) ''', re.MULTILINE | re.VERBOSE) result = ((school.group('school_name'), grade.group('grade'), student_number, name, score) for school in rx_school.finditer(string) for grade in rx_grade.finditer(school.group('school_content')) for student_score in rx_student_score.finditer(grade.group('students')) for student in zip(student_score.group('student_names')[:-1].split("\n"), student_score.group('student_scores')[:-1].split("\n")) for student_number in [student[0].split(", ")[0]] for name in [student[0].split(", ")[1]] for score in [student[1].split(", ")[1]] ) df = pd.DataFrame(result, columns = ['School', 'Grade', 'Student number', 'Name', 'Score']) print(df) 构造函数(以及列名)。

    <小时/> 代码:

    rx_school = re.compile(r'^School\s*=\s*(?P<school_name>.+)(?P<school_content>[\s\S]+?)(?=^School|\Z)', re.MULTILINE)
    rx_grade = re.compile(r'^Grade\s*=\s*(?P<grade>.+)(?P<students>[\s\S]+?)(?=^Grade|\Z)', re.MULTILINE)
    rx_student_score = re.compile(r'^Student number, Name[\n\r](?P<student_names>(?:^\d+.+[\n\r])+)\s*^Student number, Score[\n\r](?P<student_scores>(?:^\d+.+[\n\r])+)', re.MULTILINE)
    

    <小时/> 冷凝:

                School Grade Student number      Name Score
    0   Riverdale High     1              0    Phoebe     3
    1   Riverdale High     1              1    Rachel     7
    2   Riverdale High     2              0    Angela     6
    3   Riverdale High     2              1   Tristan     3
    4   Riverdale High     2              2    Aurora     9
    5         Hogwarts     1              0     Ginny     8
    6         Hogwarts     1              1      Luna     7
    7         Hogwarts     2              0     Harry     5
    8         Hogwarts     2              1  Hermione    10
    9         Hogwarts     3              0      Fred     0
    10        Hogwarts     3              1    George     0
    

    <小时/> 这产生了

    import timeit
    print(timeit.timeit(makedf, number=10**4))
    # 11.918397722000009 s
    

    <小时/> 至于计时,这是运行它一万次的结果:

    {{1}}

答案 1 :(得分:7)

这是我的建议使用split和pd.concat(“txt”代表问题中原始文本的副本), 基本上,这个想法是按组词分割然后连接成数据框,最内部的解析利用了名称和等级是csv格式的事实。 在这里:

import pandas as pd
from io import StringIO

schools = txt.lower().split('school = ')
schools_dfs = []
for school in schools[1:]:
    grades = school.split('grade = ') 
    grades_dfs = []
    for grade in grades[1:]:
        features = grade.split('student number,')
        feature_dfs = []
        for feature in features[1:]:
            feature_dfs.append(pd.read_csv(StringIO(feature)))
        feature_df = pd.concat(feature_dfs, axis=1)
        feature_df['grade'] = features[0].replace('\n','')
        grades_dfs.append(feature_df)
    grades_df = pd.concat(grades_dfs)
    grades_df['school'] = grades[0].replace('\n','')
    schools_dfs.append(grades_df)
schools_df = pd.concat(schools_dfs)

schools_df.set_index(['school', 'grade'])

enter image description here

答案 2 :(得分:6)

我建议使用像parsy这样的解析器组合库。与使用正则表达式相比,结果不会那么简洁,但它会更加可读和健壮,同时仍然相对轻量级。

解析通常是一项非常艰巨的任务,并且很难找到一种对普通编程初学者有用的方法。

编辑: 一些实际的示例代码,对您提供的示例进行最小的解析。它不会传递给大熊猫,甚至不会将名称与分数相匹配,或者将学生与分数相匹配等等 - 它只返回顶部以School开头的对象层次结构,并具有您期望的相关属性: / p>

from parsy import string, regex, seq
import attr


@attr.s
class Student():
    name = attr.ib()
    number = attr.ib()


@attr.s
class Score():
    score = attr.ib()
    number = attr.ib()


@attr.s
class Grade():
    grade = attr.ib()
    students = attr.ib()
    scores = attr.ib()


@attr.s
class School():
    name = attr.ib()
    grades = attr.ib()


integer = regex(r"\d+").map(int)
student_number = integer
score = integer
student_name = regex(r"[^\n]+")
student_def = seq(student_number.tag('number') << string(", "),
                  student_name.tag('name') << string("\n")).combine_dict(Student)
student_def_list = string("Student number, Name\n") >> student_def.many()
score_def = seq(student_number.tag('number') << string(", "),
                score.tag('score') << string("\n")).combine_dict(Score)
score_def_list = string("Student number, Score\n") >> score_def.many()
grade_value = integer
grade_def = string("Grade = ") >> grade_value << string("\n")
school_grade = seq(grade_def.tag('grade'),
                   student_def_list.tag('students') << regex(r"\n*"),
                   score_def_list.tag('scores') << regex(r"\n*")
                   ).combine_dict(Grade)

school_name = regex(r"[^\n]+")
school_def = string("School = ") >> school_name << string("\n")
school = seq(school_def.tag('name'),
             school_grade.many().tag('grades')
             ).combine_dict(School)


def parse(text):
    return school.many().parse(text)

这比正则表达式解决方案更冗长,但更接近于文件格式的声明性定义。

答案 3 :(得分:3)

以与原始代码类似的方式定义解析正则表达式

import re
import pandas as pd

parse_re = {
    'school': re.compile(r'School = (?P<school>.*)$'),
    'grade': re.compile(r'Grade = (?P<grade>\d+)'),
    'student': re.compile(r'Student number, (?P<info>\w+)'),
    'data': re.compile(r'(?P<number>\d+), (?P<value>.*)$'),
}

def parse(line):
    '''parse the line by regex search against possible line formats
       returning the id and match result of first matching regex,
       or None if no match is found'''
    return reduce(lambda (i,m),(id,rx): (i,m) if m else (id, rx.search(line)), 
                  parse_re.items(), (None,None))

然后循环收集有关每个学生的信息。记录完成后(当记录完成Score时),我们将记录追加到列表中。

由逐行正则表达式驱动的小型状态机匹配每个记录。特别是我们必须将学生按成绩保存,因为他们的分数和名称在输入文件中单独提供。

results = []
with open('sample.txt') as f:
    record = {}
    for line in f:
        id, match = parse(line)

        if match is None:
            continue

        if id == 'school':
            record['School'] = match.group('school')
        elif id == 'grade':
            record['Grade'] = int(match.group('grade'))
            names = {}  # names is a number indexed dictionary of student names
        elif id == 'student':
            info = match.group('info')
        elif id == 'data':
            number = int(match.group('number'))
            value = match.group('value')
            if info == 'Name':
                names[number] = value
            elif info == 'Score':
                record['Student number'] = number
                record['Name'] = names[number]
                record['Score'] = int(value)
                results.append(record.copy())

最后,记录列表将转换为DataFrame

df = pd.DataFrame(results, columns=['School', 'Grade', 'Student number', 'Name', 'Score'])
print df

输出:

            School  Grade  Student number      Name  Score
0   Riverdale High      1               0    Phoebe      3
1   Riverdale High      1               1    Rachel      7
2   Riverdale High      2               0    Angela      6
3   Riverdale High      2               1   Tristan      3
4   Riverdale High      2               2    Aurora      9
5         Hogwarts      1               0     Ginny      8
6         Hogwarts      1               1      Luna      7
7         Hogwarts      2               0     Harry      5
8         Hogwarts      2               1  Hermione     10
9         Hogwarts      3               0      Fred      0
10        Hogwarts      3               1    George      0

一些优化是首先比较最常见的正则表达式并明确跳过空白行。我们去的时候构建数据帧会避免额外的数据副本,但我认为附加到数据帧是一项昂贵的操作。