优雅的结构化文本文件解析

时间:2008-10-21 23:00:21

标签: python ruby perl text-parsing

我需要解析实时聊天对话的记录。我第一次看到该文件的想法是在问题上抛出正则表达式,但我想知道人们使用了什么其他方法。

我把优雅放在标题中,因为我之前发现这种类型的任务有可能难以维持,只依赖正则表达式。

成绩单由www.providesupport.com生成并通过电子邮件发送到一个帐户,然后我从电子邮件中提取纯文本成绩单附件。

解析文件的原因是为了以后提取对话文本,还要识别访问者和运营商名称,以便通过CRM提供信息。

以下是成绩单文件的示例:

Chat Transcript

Visitor: Random Website Visitor 
Operator: Milton
Company: Initech
Started: 16 Oct 2008 9:13:58
Finished: 16 Oct 2008 9:45:44

Random Website Visitor: Where do i get the cover sheet for the TPS report?
* There are no operators available at the moment. If you would like to leave a message, please type it in the input field below and click "Send" button
* Call accepted by operator Milton. Currently in room: Milton, Random Website Visitor.
Milton: Y-- Excuse me. You-- I believe you have my stapler?
Random Website Visitor: I really just need the cover sheet, okay?
Milton: it's not okay because if they take my stapler then I'll, I'll, I'll set the building on fire...
Random Website Visitor: oh i found it, thanks anyway.
* Random Website Visitor is now off-line and may not reply. Currently in room: Milton.
Milton: Well, Ok. But… that's the last straw.
* Milton has left the conversation. Currently in room:  room is empty.

Visitor Details
---------------
Your Name: Random Website Visitor
Your Question: Where do i get the cover sheet for the TPS report?
IP Address: 255.255.255.255
Host Name: 255.255.255.255
Referrer: Unknown
Browser/OS: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322; InfoPath.1; .NET CLR 2.0.50727)

9 个答案:

答案 0 :(得分:12)

不,事实上,对于您描述的特定类型的任务,我怀疑与正则表达式相比,有一种“更清洁”的方式。看起来你的文件有嵌入的换行符,所以我们在这里做的通常是将行作为分解单元,应用每行正则表达式。同时,您创建一个小型状态机并使用正则表达式匹配来触发该状态机中的转换。通过这种方式,您可以了解文件中的位置以及可以预期的字符数据类型。另外,请考虑使用命名捕获组并从外部文件加载正则表达式。这样,如果你的成绩单的格式发生了变化,那么调整正则表达式就好了,而不是编写新的特定于解析的代码。

答案 1 :(得分:11)

使用Perl,您可以使用Parse::RecDescent

这很简单,你的语法可以在以后维护。

答案 2 :(得分:6)

您可能需要考虑一个完整的解析器生成器。

正则表达式适用于搜索小子串的文本,但如果您真的对将整个文件解析为有意义的数据感兴趣,那么它们的功能很差。

如果子串的上下文很重要,它们尤其不足。

大多数人都把正则表达式扔到一切,因为这就是他们所知道的。他们从未学过任何解析器生成工具,他们最终编写了许多生成规则组合和语义操作处理,您可以使用解析器生成器免费获得。

正则表达式很棒,但是如果你需要一个解析器,它们就不可替代。

答案 3 :(得分:6)

这是基于lepl解析器生成器库的两个解析器。它们都产生相同的结果。

from pprint import pprint
from lepl import AnyBut, Drop, Eos, Newline, Separator, SkipTo, Space

# field = name , ":" , value
name, value = AnyBut(':\n')[1:,...], AnyBut('\n')[::'n',...]    
with Separator(~Space()[:]):
    field = name & Drop(':') & value & ~(Newline() | Eos()) > tuple

header_start   = SkipTo('Chat Transcript' & Newline()[2])
header         = ~header_start & field[1:] > dict
server_message = Drop('* ') & AnyBut('\n')[:,...] & ~Newline() > 'Server'
conversation   = (server_message | field)[1:] > list
footer_start   = 'Visitor Details' & Newline() & '-'*15 & Newline()
footer         = ~footer_start & field[1:] > dict
chat_log       = header & ~Newline() & conversation & ~Newline() & footer

pprint(chat_log.parse_file(open('chat.log')))

Stricter Parser

from pprint import pprint
from lepl import And, Drop, Newline, Or, Regexp, SkipTo

def Field(name, value=Regexp(r'\s*(.*?)\s*?\n')):
    """'name , ":" , value' matcher"""
    return name & Drop(':') & value > tuple

Fields = lambda names: reduce(And, map(Field, names))

header_start   = SkipTo(Regexp(r'^Chat Transcript$') & Newline()[2])
header_fields  = Fields("Visitor Operator Company Started Finished".split())
server_message = Regexp(r'^\* (.*?)\n') > 'Server'
footer_fields  = Fields(("Your Name, Your Question, IP Address, "
                         "Host Name, Referrer, Browser/OS").split(', '))

with open('chat.log') as f:
    # parse header to find Visitor and Operator's names
    headers, = (~header_start & header_fields > dict).parse_file(f)
    # only Visitor, Operator and Server may take part in the conversation
    message = reduce(Or, [Field(headers[name])
                          for name in "Visitor Operator".split()])
    conversation = (message | server_message)[1:]
    messages, footers = ((conversation > list)
                         & Drop('\nVisitor Details\n---------------\n')
                         & (footer_fields > dict)).parse_file(f)

pprint((headers, messages, footers))

输出:

({'Company': 'Initech',
  'Finished': '16 Oct 2008 9:45:44',
  'Operator': 'Milton',
  'Started': '16 Oct 2008 9:13:58',
  'Visitor': 'Random Website Visitor'},
 [('Random Website Visitor',
   'Where do i get the cover sheet for the TPS report?'),
  ('Server',
   'There are no operators available at the moment. If you would like to leave a message, please type it in the input field below and click "Send" button'),
  ('Server',
   'Call accepted by operator Milton. Currently in room: Milton, Random Website Visitor.'),
  ('Milton', 'Y-- Excuse me. You-- I believe you have my stapler?'),
  ('Random Website Visitor', 'I really just need the cover sheet, okay?'),
  ('Milton',
   "it's not okay because if they take my stapler then I'll, I'll, I'll set the building on fire..."),
  ('Random Website Visitor', 'oh i found it, thanks anyway.'),
  ('Server',
   'Random Website Visitor is now off-line and may not reply. Currently in room: Milton.'),
  ('Milton', "Well, Ok. But… that's the last straw."),
  ('Server',
   'Milton has left the conversation. Currently in room:  room is empty.')],
 {'Browser/OS': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322; InfoPath.1; .NET CLR 2.0.50727)',
  'Host Name': '255.255.255.255',
  'IP Address': '255.255.255.255',
  'Referrer': 'Unknown',
  'Your Name': 'Random Website Visitor',
  'Your Question': 'Where do i get the cover sheet for the TPS report?'})

答案 4 :(得分:5)

Build a parser?我无法确定您的数据是否足够常规,但可能值得研究。

答案 5 :(得分:4)

使用多行注释的正则表达式可以在一定程度上缓解维护问题。尽量避免使用单行超级正则表达式!

另外,考虑将正则表达式分解为单个任务,每个任务对应一个“你想要的东西”。例如

visitor = text.find(/Visitor:(.*)/)
operator = text.find(/Operator:(.*)/)
body = text.find(/whatever....)

而不是

text.match(/Visitor:(.*)\nOperator:(.*)...whatever to giant regex/m) do
  visitor = $1
  operator = $2
  etc.
end

然后,它可以轻松更改任何特定项目的解析方式。解析具有许多“聊天块”的文件时,只需要一个与单个聊天块匹配的简单正则表达式,迭代文本并将匹配数据从此传递给您的其他匹配器组。

这显然会影响性能,但除非您处理巨大的文件,否则我不会担心。

答案 6 :(得分:2)

考虑使用Ragel http://www.complang.org/ragel/

这就是引擎盖下的杂种。多次解析字符串会大大减慢速度。

答案 7 :(得分:2)

我使用过Paul McGuire的pyParsing类库,我继续对它印象深刻,因为它有详细记录,易于上手,并且规则易于调整和维护。顺便说一句,规则用你的python代码表示。当然,日志文件似乎有足够的规律性来解析每一行作为一个独立的单元。

答案 8 :(得分:0)

只是一个快速的帖子,我只看了你的成绩单示例,但我最近也不得不研究文本解析,并希望避免走手动解析的路线。我确实发生了Ragel,我只是开始了解它,但它看起来非常有用。