Question

我有一个包含许多xml类元素的文件，例如：

<document docid=1>
Preliminary Report-International Algebraic Language
Perlis, A. J. & Samelson,K.
CACM December, 1958
</document>

我需要解析docid和文本。什么是合适的正则表达式？

我试过这个，但它不起作用：

collectionText = open('documents.txt').read()
docsPattern = r'<document docid=(\d+)>(.)*</document>'
docTuples = re.findall(docsPattern, collectionText)

编辑：我修改了这样的模式：

<document docid=(\d+)>(.*)</document>

遗憾的是，这与整个文档不符合单个文档元素。

EDIT2：Ahmad和Acorn的答案正确实施：

collectionText = open('documents.txt').read()
docsPattern = r'<document docid=(\d+)>(.*?)</document>'
docTuples = re.findall(docsPattern, collectionText, re.DOTALL)

Answer 1

您需要在正则表达式中使用DOTALL选项，以便它匹配多行（默认情况下.将与换行符不匹配。）

还要注意艾哈迈德回答中关于贪婪的评论。

import re

text = '''<document docid=1>
Preliminary Report-International Algebraic Language
Perlis, A. J. & Samelson,K.
CACM December, 1958
</document>'''

pattern = r'<document docid=(\d+)>(.*?)</document>'
print re.findall(pattern, text, re.DOTALL)

通常，正则表达式不适合解析XML / HTML。

请参阅：

RegEx match open tags except XHTML self-contained tags和http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html

您想使用lxml之类的解析器。

Answer 2

你的模式很贪婪，所以如果你有多个<document>元素，它们最终会匹配所有元素。

您可以使用.*?使其变得非贪婪，这意味着“尽可能少地匹配零个或多个字符”。更新的模式是：

<document docid=(\d+)>(.*?)</document>

Answer 3

似乎只为FYI ...

工作.net“xml-like”结构

<([^<>]+)>([^<>]+)<(\/[^<>]+)>

使用正则表达式解析xml文档

3 个答案: