Question

考虑到这一点：

input = """Yesterday<person>Peter</person>drove to<location>New York</location>"""

如何使用正则表达式模式进行提取：

person: Peter
location: New York

这很好用，但我不想硬编码标签，它们可以改变：

print re.findall("<person>(.*?)</person>", input)
print re.findall("<location>(.*?)</location>", input)

Answer 1

使用专为工作设计的工具。我碰巧喜欢lxml，但他们是其他的

>>> minput = """Yesterday<person>Peter Smith</person>drove to<location>New York</location>"""
>>> from lxml import html
>>> tree = html.fromstring(minput)
>>> for e in tree.iter():
        print e, e.tag, e.text_content()
        if e.tag() == 'person':          # getting the last name per comment
           last = e.text_content().split()[-1]
           print last


<Element p at 0x3118ca8> p YesterdayPeter Smithdrove toNew York
<Element person at 0x3118b48> person Peter Smith
Smith                                            # here is the last name
<Element location at 0x3118ba0> location New York

如果您不熟悉Python，那么您可能需要访问此site以获取包含LXML在内的多个软件包的安装程序。

Answer 2

避免使用正则表达式解析HTML，而是使用HTML解析器。

以下是使用BeautifulSoup的示例：

from bs4 import BeautifulSoup    

data = "Yesterday<person>Peter</person>drove to<location>New York</location>"
soup = BeautifulSoup(data)

print 'person: %s' % soup.person.text
print 'location: %s' % soup.location.text

打印：

person: Peter
location: New York

请注意代码的简单性。

希望有所帮助。

正则表达式模式提取标签及其内容

2 个答案: