如何使用正则表达式或工具包将句子解析为标记

时间:2014-03-26 15:05:04

标签: python regex xml-parsing beautifulsoup lxml

如何用正则表达式或像beautifulsoup,lxml这样的工具包解析这样的句子:

input = """Yesterday<person>Peter Smith</person>drove to<location>New York</location>"""

进入这个:

Yesterday
<person>Peter Smith</person>
drove
to
<location>New York</location>

我无法使用re.findall("<person>(.*?)</person>", input)因为标签不同。

2 个答案:

答案 0 :(得分:3)

看看使用BeautifulSoup

是多么容易
from bs4 import BeautifulSoup

data = """Yesterday<person>Peter Smith</person>drove to<location>New York</location>"""

soup = BeautifulSoup(data, 'html.parser')
for item in soup:
    print item

打印:

Yesterday
<person>Peter Smith</person>
drove to
<location>New York</location>

UPD(将非标签物品拆分成空格并在新线上打印每个部分):

soup = BeautifulSoup(data, 'html.parser')
for item in soup:
    if not isinstance(item, Tag):
        for part in item.split():
            print part
    else:
        print item

打印:

Yesterday
<person>Peter Smith</person>
drove
to
<location>New York</location>

希望有所帮助。

答案 1 :(得分:0)

试试这个正则表达式 -

>>> import re
>>> input = """Yesterday<person>Peter Smith</person>drove to<location>New York</location>"""
>>> print re.sub("<[^>]*?[^/]\s*>[^<]*?</.*?>",r"\n\g<0>\n",input)
Yesterday
<person>Peter Smith</person>
drove to
<location>New York</location>

>>> 

正则表达式的演示here