使用Python和Beautiful Soup解析HTML

时间:2011-07-03 22:44:40

标签: python html find beautifulsoup web-scraping

<div class="profile-row clearfix"><div class="profile-row-header">Member Since</div><div class="profile-information">January 2010</div></div>
<div class="profile-row clearfix"><div class="profile-row-header">AIGA Chapter</div><div class="profile-information">Alaska</div></div>
<div class="profile-row clearfix"><div class="profile-row-header">Title</div><div class="profile-information">Owner</div></div>
<div class="profile-row clearfix"><div class="profile-row-header">Company</div><div class="profile-information">Mad Dog Graphx</div></div>

我正在使用Beautiful Soup在HTML代码中达到这一点。我现在想要搜索代码,并提取数据,如2010年1月,阿拉斯加,所有者和疯狗图。所有这些数据都具有相同的类别,但它们之前有不同的变量,如“成员自”,“AIGA章”等。我怎样才能搜索会员,从而获得2010年1月。对其他3个领域也一样吗?

1 个答案:

答案 0 :(得分:3)

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup('''<div class="profile-row clearfix"><div class="profile-row-header">Member Since</div><div class="profile-information">January 2010</div></div>
... <div class="profile-row clearfix"><div class="profile-row-header">AIGA Chapter</div><div class="profile-information">Alaska</div></div>
... <div class="profile-row clearfix"><div class="profile-row-header">Title</div><div class="profile-information">Owner</div></div>
... <div class="profile-row clearfix"><div class="profile-row-header">Company</div><div class="profile-information">Mad Dog Graphx</div></div>
... ''')
>>> for row in soup.findAll('div', {'class':'profile-row clearfix'}):
...  field, value = row.findAll(text = True)
...  print field, value
... 
Member Since January 2010
AIGA Chapter Alaska
Title Owner
Company Mad Dog Graphx

您当然可以使用fieldvalue执行任何操作,例如使用它们创建dict或将它们存储在数据库中。

如果“profile-row clearfix”div中有其他div或其他文本节点,则需要执行field = row.find('div', {'class':'profile-row-header'}).findAll(text=True)等操作。