Question

I really don't believe in generic text file parser anymore - especially those files are meant for human readers. Files like HTML and web log can be well handled by Beautiful Soap or Regular Expression. But the human readable text file is still a tough nut to crack.

Just that I am willing to hand-coded a text file parser, tailoring every different format I would encounter. I still want to see if it is possible to have a better program structure in the way that I will still able to understand the program logic 3 months down the road. Also to make it readable.

Today I was given a problem to extract the time-stamps from a file:

"As of 12:30:45, ..."
"Between 1:12:00 and 3:10:45, ..."
"During this time from 3:44:50 to 4:20:55 we have ..."

The parsing is straightforward. I have the time-stamps in different locations on each line. But I am think how should I design the module/function in the way that: (1) each line format will be handle separately, (2) how to branch to the relevant function. For example, I can code each line parser like this:

def parse_as(s):
    return s.split(' ')[2], s.split(' ')[2] # returning the second same as the first for the case that only one time stamp is found

def parse_between(s):
    return s.split(' ')[2], s.split(' ')[4]

def parse_during(s):
    return s.split(' ')[4], s.split(' ')[6]

This can help me to have a quick idea about the formats already handled by the program. I can always add a new function in case I encounter another new format.

However, I still don't have an elegant way to branch to the relevant function.

# open file
for l in f.readline():
    s = l.split(' ')
    if s == 'As': 
       ts1, ts2 = parse_as(l)
    else:
       if s == 'Between':
          ts1, ts2 = parse_between(l)
       else:
          if s == 'During':
             ts1, ts2 = parse_during(l)
          else:
             print 'error!'
    # process ts1 and ts2

That's not something I want to maintain.

Any suggestion? There was once I thought decorator might help but I couldn't sort it out myself. Appreciate if anyone can point me to the correct direction.

Answer 1

Consider of using dictionary mapping:

dmap = {
    'As': parse_as,
    'Between': parse_between,
    'During': parse_during
}

Then you only need to use it like this:

dmap = {
    'As': parse_as,
    'Between': parse_between,
    'During': parse_during
}

for l in f.readline():
    s = l.split(' ')
    p = dmap.get(s, None)
    if p is None:
        print('error')
    else:
        ts1, ts2 = p(l)
        #continue to process

A lot easier to maintain. If you have new function, you just need to add it into the dmap together with its keyword:

dmap = {
    'As': parse_as,
    'Between': parse_between,
    'During': parse_during,
    'After': parse_after,
    'Before': parse_before
    #and so on
}

Answer 2

What about

start_with = ["As", "Between", "During"]
parsers = [parse_as, parse_between, parse_during]


for l in f.readlines():
    match_found = False

    for start, f in zip(start_with, parsers):
        if l.startswith(start):
            ts1, ts2 = f(l.split(' '))
            match_found = True
            break

    if not match_found:
        raise NotImplementedError('Not found!')

or with a dict as Ian mentioned:

rules = {
    "As": parse_as,
    "Between": parse_between,
    "During": parse_during
}

for l in f.readlines():
    match_found = False

    for start, f in rules.items():
        if l.startswith(start):
            ts1, ts2 = f(l.split(' '))
            match_found = True
            break

    if not match_found:
        raise NotImplementedError('Not found!')

Answer 3

为什么不使用正则表达式？

import re

# open file
with open('datafile.txt') as f:
    for line in f:
        ts_vals = re.findall(r'(\d+:\d\d:\d\d)', line)
        # process ts1 and ts2

因此ts_vals将是一个包含所提供示例的一个或两个元素的列表。

Design a module to parse text file

3 个答案: