Question

我正在尝试从yp.com抓取列表，在构建代码时，我能够使用名称(div class="search-results organic")隔离该部分，但是当我在该内容上运行find_all()时，它返回该部分之外的列表。

网址为http://www.yellowpages.com/search?search_terms=septic&geo_location_terms=80521

这是我正在运行的：

from bs4 import BeautifulSoup
import urllib
import re
import xml
import requests
from urlparse import urlparse

filename = "webspyorganictag.html"
term = "septic"
zipcode = "80521"
url = "http://www.yellowpages.com/search?search_terms="+ term +"&geo_location_terms="+ zipcode

with open(filename, "w") as myfile:
    myfile.write("Information from the organic<br>")

r = requests.get(url)
soup = BeautifulSoup(r.content, "xml")
organic = soup.find("div", {"class": "search-results organic"})

with open(filename, "a") as myfile:
    myfile.write(str(organic))

这只返回有机列表部分中的内容。有30个列表。

然后，我补充说：

listings = organic.find_all("div", {"class": "info"})
i = 1
with open(filename, "a") as myfile:
    for listing in listings:
        myfile.write("This is listing " + str(i) + "<br>")
        myfile.write(str(listing) + "<br>")
        i += 1

这将返回原始的30个列表以及另外10个列表（除了id =“main-aside”），这些列表不包含在变量'organic'中。

不应该调用organic.find_all()将范围限制为变量'organic'中的数据吗？

Answer 1

使用"xml"，你会发现class="info">有soup.find("div", {"class": "search-results organic"})，因此，使用find_all返回41并不奇怪。您将获得返回的其他元素，通过查看有机返回值（即href="/wray-co/mip/ritcheys-redi-mix-precast-inc-10367117?lid=1000575822573"，href="/longmont-co/mip/rays-backhoe-service-6327932?lid=216924340"以及十个特征中的所有其他列表，可以轻松查看这些元素。

如果你看一下你写的html的第41行，它还包含：

href="/wray-co/mip/ritcheys-redi-mix-precast-inc-10367117?lid=1000575822573"这是精选商品详情中的最后一个。

问题是解析器，如果您将解析器更改为"lxml"：

soup = BeautifulSoup(r.content,"lxml")

organic = soup.find("div", {"class": "search-results organic"})

print(len(organic.find_all("h3",{"class":"info"})))
30

或使用html.parser：

soup = BeautifulSoup(r.content,"html.parser") 

organic = soup.find("div", {"class": "search-results organic"})

print(len(organic.find_all("div",{"class":"info"})))
30

你得到了正确的结果。

当我运行find_all时，BeautifulSoup会添加内容

1 个答案: