Question

如何用正则表达式或像beautifulsoup，lxml这样的工具包解析这样的句子：

input = """Yesterday<person>Peter Smith</person>drove to<location>New York</location>"""

进入这个：

Yesterday
<person>Peter Smith</person>
drove
to
<location>New York</location>

我无法使用re.findall("<person>(.*?)</person>", input)因为标签不同。

Answer 1

看看使用BeautifulSoup：

是多么容易

from bs4 import BeautifulSoup

data = """Yesterday<person>Peter Smith</person>drove to<location>New York</location>"""

soup = BeautifulSoup(data, 'html.parser')
for item in soup:
    print item

打印：

Yesterday
<person>Peter Smith</person>
drove to
<location>New York</location>

UPD（将非标签物品拆分成空格并在新线上打印每个部分）：

soup = BeautifulSoup(data, 'html.parser')
for item in soup:
    if not isinstance(item, Tag):
        for part in item.split():
            print part
    else:
        print item

打印：

Yesterday
<person>Peter Smith</person>
drove
to
<location>New York</location>

希望有所帮助。

Answer 2

试试这个正则表达式 -

>>> import re
>>> input = """Yesterday<person>Peter Smith</person>drove to<location>New York</location>"""
>>> print re.sub("<[^>]*?[^/]\s*>[^<]*?</.*?>",r"\n\g<0>\n",input)
Yesterday
<person>Peter Smith</person>
drove to
<location>New York</location>

>>>

正则表达式的演示here

如何使用正则表达式或工具包将句子解析为标记

2 个答案: