如何用正则表达式或像beautifulsoup,lxml这样的工具包解析这样的句子:
input = """Yesterday<person>Peter Smith</person>drove to<location>New York</location>"""
进入这个:
Yesterday
<person>Peter Smith</person>
drove
to
<location>New York</location>
我无法使用re.findall("<person>(.*?)</person>", input)
因为标签不同。
答案 0 :(得分:3)
看看使用BeautifulSoup
:
from bs4 import BeautifulSoup
data = """Yesterday<person>Peter Smith</person>drove to<location>New York</location>"""
soup = BeautifulSoup(data, 'html.parser')
for item in soup:
print item
打印:
Yesterday
<person>Peter Smith</person>
drove to
<location>New York</location>
UPD(将非标签物品拆分成空格并在新线上打印每个部分):
soup = BeautifulSoup(data, 'html.parser')
for item in soup:
if not isinstance(item, Tag):
for part in item.split():
print part
else:
print item
打印:
Yesterday
<person>Peter Smith</person>
drove
to
<location>New York</location>
希望有所帮助。
答案 1 :(得分:0)
试试这个正则表达式 -
>>> import re
>>> input = """Yesterday<person>Peter Smith</person>drove to<location>New York</location>"""
>>> print re.sub("<[^>]*?[^/]\s*>[^<]*?</.*?>",r"\n\g<0>\n",input)
Yesterday
<person>Peter Smith</person>
drove to
<location>New York</location>
>>>
正则表达式的演示here