使用BeautifulSoup进行网页抓取时出错

时间:2018-10-05 05:16:18

标签: python web-scraping beautifulsoup

我正在从zillow的网站上收集房屋数据。到目前为止,我已经从第一个网页上收集了数据。对于下一步,我试图找到指向下一个按钮的链接,该按钮会将我导航到第2页,第3页,等等。我使用了Chrome的检查功能来定位“下一个按钮”按钮,该按钮具有以下结构

<a href=”/homes/recently_sold/house_type/47164_rid/0_singlestory/37.720288,-121.859322,37.601788,-121.918888_rect/12_zm/2_p/” class=”on” onclick=”SearchMain.changePage(2);return false;” id=”yui_3_18_1_1_1525048531062_27962">Next</a>

然后我使用Beautiful Soup的find_all方法并在标签“ a”和类“ on”上进行过滤。我使用以下代码提取了所有链接

  

driver = webdriver.Chrome(chromedriver)  
zillow_bellevue_1="https://www.zillow.com/homes/Bellevue-WA-98004_rb/"
driver.get(zillow_bellevue_1)   
soup = BeautifulSoup(driver.page_source,'html.parser')

next_button = soup.find_all("a", class_="on")  
print(next_button)

我没有得到任何输出。我要去哪里出错了吗?

1 个答案:

答案 0 :(得分:0)

next按钮的类似乎是off而不是on,因此您可以抓取每个属性的详细信息并按如下方式浏览所有页面。它使用requests库获取HTML,该HTML应该比使用chrome驱动程序更快。

from bs4 import BeautifulSoup
import requests

base_url = "https://www.zillow.com"
url = base_url + "/homes/Bellevue-WA-98004_rb/"

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'}    

while url:
    req = requests.get(url, headers=headers)   
    soup = BeautifulSoup(req.content, 'html.parser')
    print('\n' + url)

    for div in soup.find_all('div', class_="zsg-photo-card-caption"):
        print("  {}".format(list(div.stripped_strings)))

    next_button = soup.find("a", class_="off", href=True)  
    url = base_url + next_button['href'] if next_button else None

这将继续请求URL,直到找不到下一个按钮。输出格式为:

https://www.zillow.com/homes/Bellevue-WA-98004_rb/
  ['New Construction', '$2,224,995+', '5 bds', '·', '4 ba', '·', '3,796+ sqft', 'The Castille Plan, Verano', 'D.R. Horton - Seattle']
  ['12 Central Square', '2', '$2,550+', '10290 NE 12th St, Bellevue, WA']
  ['Apartment For Rent', '$1,800/mo', '1 bd', '·', '1 ba', '·', '812 sqft', '10423 NE 32nd Pl APT E105, Bellevue, WA']
  ['House For Sale', '$1,898,000', '5 bds', '·', '4 ba', '·', '4,030 sqft', '3230 108th Ave SE, Bellevue, WA', 'Quorum Real Estate/Madison Inc']
  ['New Construction', '-- bds', '·', '-- ba', '·', '-- sqft', 'Coming Soon Plan, Northtowne', 'D.R. Horton - Seattle']
  ['The Meyden', '0', '$1,661+', '1', '$2,052+', '2', '$3,240+', '10333 Main St, Bellevue, WA']