使用beautifulsoup解析HTML类元素的问题

时间:2013-11-22 10:00:39

标签: python parsing beautifulsoup

url = 'http://www.zillow.com/homedetails/3728-Balcary-Bay-Champaign-IL-61822/89057727_zpid/'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)

info = soup.findAll('span',{'itemtype':'http://schema.org/GeoCoordinates'}) #this tag + class combination found 4 matches, 4th one was the required one, just selecting that here
for form in info:
        b= form.find('meta')['content']
print b

这是我用来从Zillow获取纬度和经度信息的代码的快照。我可以使用span和itemtype精确定位存储纬度和经度信息的代码。 我正在解析此数据的地方有一个类似于下面的代码:

<span itemprop="geo" itemscope="" itemtype="http://schema.org/GeoCoordinates">
<meta content="40.12938" itemprop="latitude">
<meta content="-88.30766" itemprop="longitude">
</span>

我可以获取纬度信息但无法获取经度信息。有人可以帮助我获取这些信息吗?

代码输出:

>>> ================================ RESTART ================================
>>> 
40.12938
>>> 

预期输出:

>>> ================================ RESTART ================================
>>> 
40.12938 -88.30766
>>> 

1 个答案:

答案 0 :(得分:1)

form.find()找到第一个结果<meta content="40.12938" itemprop="latitude">但是使用form.find_all()方法返回所有结果,然后您可以使用列表推导将它们添加到列表中,如图所示下面:

url = 'http://www.zillow.com/homedetails/3728-Balcary-Bay-Champaign-IL-61822/89057727_zpid/'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)

info = soup.findAll('span',{'itemtype':'http://schema.org/GeoCoordinates'}) #this tag + class combination found 4 matches, 4th one was the required one, just selecting that here
cordinates = [i['content'] for i in info[0].find_all('meta')]

print cordinates

它会产生:

[u'40.12938', u'-88.30766']