我正在从zillow的网站上收集房屋数据。到目前为止,我已经从第一个网页上收集了数据。对于下一步,我试图找到指向下一个按钮的链接,该按钮会将我导航到第2页,第3页,等等。我使用了Chrome的检查功能来定位“下一个按钮”按钮,该按钮具有以下结构
<a href=”/homes/recently_sold/house_type/47164_rid/0_singlestory/37.720288,-121.859322,37.601788,-121.918888_rect/12_zm/2_p/” class=”on” onclick=”SearchMain.changePage(2);return false;” id=”yui_3_18_1_1_1525048531062_27962">Next</a>
然后我使用Beautiful Soup的find_all方法并在标签“ a”和类“ on”上进行过滤。我使用以下代码提取了所有链接
driver = webdriver.Chrome(chromedriver)
zillow_bellevue_1="https://www.zillow.com/homes/Bellevue-WA-98004_rb/"
driver.get(zillow_bellevue_1)
soup = BeautifulSoup(driver.page_source,'html.parser')
next_button = soup.find_all("a", class_="on")
print(next_button)
我没有得到任何输出。我要去哪里出错了吗?
答案 0 :(得分:0)
next
按钮的类似乎是off
而不是on
,因此您可以抓取每个属性的详细信息并按如下方式浏览所有页面。它使用requests
库获取HTML,该HTML应该比使用chrome驱动程序更快。
from bs4 import BeautifulSoup
import requests
base_url = "https://www.zillow.com"
url = base_url + "/homes/Bellevue-WA-98004_rb/"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'}
while url:
req = requests.get(url, headers=headers)
soup = BeautifulSoup(req.content, 'html.parser')
print('\n' + url)
for div in soup.find_all('div', class_="zsg-photo-card-caption"):
print(" {}".format(list(div.stripped_strings)))
next_button = soup.find("a", class_="off", href=True)
url = base_url + next_button['href'] if next_button else None
这将继续请求URL,直到找不到下一个按钮。输出格式为:
https://www.zillow.com/homes/Bellevue-WA-98004_rb/
['New Construction', '$2,224,995+', '5 bds', '·', '4 ba', '·', '3,796+ sqft', 'The Castille Plan, Verano', 'D.R. Horton - Seattle']
['12 Central Square', '2', '$2,550+', '10290 NE 12th St, Bellevue, WA']
['Apartment For Rent', '$1,800/mo', '1 bd', '·', '1 ba', '·', '812 sqft', '10423 NE 32nd Pl APT E105, Bellevue, WA']
['House For Sale', '$1,898,000', '5 bds', '·', '4 ba', '·', '4,030 sqft', '3230 108th Ave SE, Bellevue, WA', 'Quorum Real Estate/Madison Inc']
['New Construction', '-- bds', '·', '-- ba', '·', '-- sqft', 'Coming Soon Plan, Northtowne', 'D.R. Horton - Seattle']
['The Meyden', '0', '$1,661+', '1', '$2,052+', '2', '$3,240+', '10333 Main St, Bellevue, WA']