Question

我正在通过尝试抓取数据来学习Python - Beautiful Soup。我有一个这种格式的HTML页面......

span id listing-name-1
span class address
span preferredcontact="1"
a ID websiteLink1

span id listing-name-2
span class address
span preferredcontact="2"
a ID websiteLink2

span id listing-name-3
span class address
span preferredcontact="3"
a ID websiteLink3

等等多达40个此类条目。

我希望这些类/ ID中的文本以相同的顺序显示在HTML页面上。

为了开始，我尝试了类似的东西来获取listing-name-1

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen("http://www.yellowpages.com.au/search/listings?clue=architects&locationClue=New+South+Wales&x=45&y=12")

soup = BeautifulSoup(page)

soup.find(span,attrs={"id=listing-name-1"})

抛出远程主机强行关闭现有连接错误

我不知道如何解决这个问题。我需要两方面的帮助：

如何解决该错误
如何从1到40迭代list-name-1？我不想为所有40个Span ID键入soup.find(span,attrs={"id=listing-name-1"})。

谢谢！

Answer 1

使用lxml.html，您可以直接使用网址拨打parse，这样您就不必自己致电urllib了。此外，您不想使用find或findall，而是要致电xpath，以便获得full expressiveness of xpath;如果您尝试使用find调用下面的相同表达式，则会返回invalid predicate错误。

#!/usr/bin/env python

import lxml.html

url = "http://www.yellowpages.com.au/search/listings?clue=architects&locationClue=New+South+Wales&x=45&y=12"
tree = lxml.html.parse(url)
listings = tree.xpath("//span[contains(@id,'listing-name-')]/text()")
print listings

将输出此信息，保留顺序：

['Cape Cod Australia Pty Ltd',
'BHI',
'Fibrent Pty Ltd Building & Engineering Assessments',
 ...
'Archicentre']

要在评论中回答我的回答中的问题，您要搜索的内容是<div class="listingInfoContainer">...</div>，其中包含您想要的所有信息。（名称，地址等）。然后，您可以循环遍历符合这些条件的div元素列表，并使用xpath表达式提取其余信息。请注意，在这种情况下，我使用container.xpath('.//span')将从当前节点（容器div）进行搜索，否则如果您遗漏.并且只有//span，它将从在树的顶部，您将获得匹配的所有元素的列表，这在您选择容器节点后不是您想要的。

#!/usr/bin/env python

import lxml.html

url = "http://www.yellowpages.com.au/search/listings?clue=architects&locationClue=New+South+Wales&x=45&y=12"
tree = lxml.html.parse(url)
container = tree.xpath("//div[@class='listingInfoContainer']")
listings = []
for c in container:
    data = {}
    data['name'] = c.xpath('.//span[contains(@id,"listing")]/text()')
    data['address'] = c.xpath('.//span[@class="address"]/text()')
    listings.append(data)

print listings

输出：

[{'name': ['Cape Cod Australia Pty Ltd'], 
  'address': ['4th Floor 410 Church St, North Parramatta NSW 2151']}, 
 {'name': ['BHI'], 
  'address': ['Suite 5, 65 Doody St, Alexandria NSW 2015']}, 
 {'name': ['Fibrent Pty Ltd Building & Engineering Assessments'], 
  'address': ["Suite 3B, Level 1, 72 O'Riordan St, Alexandria NSW 2015"]}, 
  ...
 {'name': ['Archicentre'], 
  'address': ['\n                                         Level 3, 60 Collins St\n                                         ',
              '\n                                         Melbourne VIC 3000\n                                    ']}]

这是一个列表（同样，按照你想要的方式保留顺序）的词典name和address，每个词都包含一个列表。最终列表由text()返回，它会保留原始html中的\n换行符，并将<br>之类的内容转换为新的列表元素。它为什么这样做的一个例子是列表项Archicentre，其中原始HTML表示是：

<span class="address">
     Level 3, 60 Collins St
     <br/>
     Melbourne VIC 3000
</span>

Answer 2

第二部分的答案很简单：

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen("http://www.yellowpages.com.au/search/listings?clue=architects&locationClue=New+South+Wales&x=45&y=12")

soup = BeautifulSoup(page)

for num in range(1, 41):
    soup.find("span", attrs={"id": "listing-name-"+str(num)})

Answer 3

你的第一个问题似乎与python无关。尝试打印page.read()并查看是否提供任何输出。尝试使用您的webbrowser打开页面，看看它是否加载。

至于你的第二个问题，你可以将正则表达式传递给findAll：

import re
import urllib2

from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen("http://www.yellowpages.com.au/search/listings?clue=architects&locationClue=New+South+Wales&x=45&y=12")

soup = BeautifulSoup(page)

listing_names = re.compile('listing-name-[0-9]+')
listings = soup.findAll('span', id=listing_names)
print(listings)

以上打印出我机器上的所有列表，因此您的第一个问题绝对不在您的代码中。

Python - 使用ID提取链接

3 个答案: