Python正则表达式通过对属于同一类别的元素进行分组

时间:2013-01-30 08:24:51

标签: python regex

我有一个这样的文件:

<table>
<span clas="city"> Miami </span> <span><a href="miami" > Miami </a> </span>
<span clas="city"> Orlando </span> <span><a href="orlando" > orlando </a></span>
</table>
<table>
<span clas="city"> Los Angeles </span> <span><a href="Los Angeles" > </a> </span>
<span clas="city"> San Diego </span>  <span><a href="Los Angeles" > San Diego</a> </span>
</table>

如果表结束(没有while循环),如何将此正则表达式re.compile('city">([^<]+)</span>')扩展到属于同一状态(表)的城市组,例如

State 1: Miami, Orlando
State 2: Los Angeles, San Diego

1 个答案:

答案 0 :(得分:3)

使用正确的HTML解析器:

from bs4 import BeautifulSoup
soup = BeautifulSoup(open(...).read())
states = {}
for i, table in enumerate(soup("table")):
    for city in table("span"):
        states.setdefault(i, []).append(city.text.strip())

将给出

states
{0: [u'Miami', u'Orlando'], 1: [u'Los Angeles', u'San Diego']}