当我运行find_all时,BeautifulSoup会添加内容

时间:2015-05-30 16:45:46

标签: python beautifulsoup

我正在尝试从yp.com抓取列表,在构建代码时,我能够使用名称(div class="search-results organic")隔离该部分,但是当我在该内容上运行find_all()时,它返回该部分之外的列表。

网址为http://www.yellowpages.com/search?search_terms=septic&geo_location_terms=80521

这是我正在运行的:

from bs4 import BeautifulSoup
import urllib
import re
import xml
import requests
from urlparse import urlparse

filename = "webspyorganictag.html"
term = "septic"
zipcode = "80521"
url = "http://www.yellowpages.com/search?search_terms="+ term +"&geo_location_terms="+ zipcode

with open(filename, "w") as myfile:
    myfile.write("Information from the organic<br>")

r = requests.get(url)
soup = BeautifulSoup(r.content, "xml")
organic = soup.find("div", {"class": "search-results organic"})

with open(filename, "a") as myfile:
    myfile.write(str(organic))

这只返回有机列表部分中的内容。有30个列表。

然后,我补充说:

listings = organic.find_all("div", {"class": "info"})
i = 1
with open(filename, "a") as myfile:
    for listing in listings:
        myfile.write("This is listing " + str(i) + "<br>")
        myfile.write(str(listing) + "<br>")
        i += 1

这将返回原始的30个列表以及另外10个列表(除了id =“main-aside”),这些列表不包含在变量'organic'中。

不应该调用organic.find_all()将范围限制为变量'organic'中的数据吗?

1 个答案:

答案 0 :(得分:1)

使用"xml",你会发现class="info">soup.find("div", {"class": "search-results organic"}),因此,使用find_all返回41并不奇怪。您将获得返回的其他元素,通过查看有机返回值(即href="/wray-co/mip/ritcheys-redi-mix-precast-inc-10367117?lid=1000575822573"href="/longmont-co/mip/rays-backhoe-service-6327932?lid=216924340"以及十个特征中的所有其他列表,可以轻松查看这些元素。

如果你看一下你写的html的第41行,它还包含:

href="/wray-co/mip/ritcheys-redi-mix-precast-inc-10367117?lid=1000575822573"这是精选商品详情中的最后一个。

问题是解析器,如果您将解析器更改为"lxml"

soup = BeautifulSoup(r.content,"lxml")

organic = soup.find("div", {"class": "search-results organic"})

print(len(organic.find_all("h3",{"class":"info"})))
30

或使用html.parser

soup = BeautifulSoup(r.content,"html.parser") 

organic = soup.find("div", {"class": "search-results organic"})

print(len(organic.find_all("div",{"class":"info"})))
30

你得到了正确的结果。