我尝试使用BeautifulSoup从CarGurus检索Mercedes-C级的数据,例如:
url1 = https://www.cargurus.com/Cars/inventorylisting/viewDetailsFilterViewInventoryListing.action?
&showNegotiable=true&sourceContext=carGurusHomePageModel
&entitySelectingHelper.selectedEntity2=c21239
&entitySelectingHelper.selectedEntity=c6079
url2 = https://www.cargurus.com/Cars/inventorylisting/viewDetailsFilterViewInventoryListing.action?
&showNegotiable=true&sourceContext=carGurusHomePageModel
&entitySelectingHelper.selectedEntity2=c21239
&entitySelectingHelper.selectedEntity=c6079#listing=260322671_isFeatured
response1 = requests.get(url1)
response2 = requests.get(url2)
注意url2是页面url1上显示的第一项的链接
(后缀为#listing=260322671_isFeatured
),我想介绍很多细节。
但是response1.content
和response2.content
的内容完全相同。
我尝试了不同的页面和不同的汽车型号,但是当我使用bs4时,都遇到了相同的问题。
顺便说一句,我正在使用MacBook,并且已经阅读了有关在Mac OS上使用WebDriver的知识,例如
driver = webdriver.Safari()
driver.get(URL)
只有通过这种方式,我才能访问特定的项目页面,但是会话将被锁定,这意味着我无法使用循环来一次又一次地访问多个页面……所以我回到了bs4,任何想法?
答案 0 :(得分:0)
数据通过Ajax / Json动态加载。但是,检查页面在何处建立连接,我们可以使用requests
来模拟它们:
url = '''https://www.cargurus.com/Cars/inventorylisting/viewDetailsFilterViewInventoryListing.action?
&showNegotiable=true&sourceContext=carGurusHomePageModel
&entitySelectingHelper.selectedEntity2=c21239
&entitySelectingHelper.selectedEntity=c6079'''
listing_detail_url = 'https://www.cargurus.com/Cars/detailListingJson.action?inventoryListing={}&searchZip=&searchDistance=100&inclusionType=DEFAULT'
import json
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
data = []
for a in soup.select('a[href^="#listing"]'): # get all listings on the page
listing_id = a['href'].split('=')[-1]
json_data = requests.get(listing_detail_url.format(listing_id)).json()
# print(json.dumps(json_data, indent=4)) # <-- uncomment this to print all data
listing_title = json_data['listing']['listingTitle']
price = json_data['listing']['price']
make_name = json_data['listing']['makeName']
model_name = json_data['listing']['modelName']
# ... other data
data.append( (listing_title, price, make_name, model_name ) )
# print the data
print('{:<80} {:<30} {:<30} {:<30}'.format('Title', 'Price', 'Brand', 'Model'))
for row in data:
print('{:<80} {:<30} {:<30} {:<30}'.format(*row))
打印:
Title Price Brand Model
2009 Mercedes-Benz C-Class C 300 Sport - $8,000 8000.0 Mercedes-Benz C-Class
2008 Mercedes-Benz C-Class C 300 Luxury - $3,500 3500.0 Mercedes-Benz C-Class
2009 Mercedes-Benz C-Class C 300 Sport - $5,999 5999.0 Mercedes-Benz C-Class
2007 Mercedes-Benz C-Class C 280 4MATIC Luxury AWD - $1,975 1975.0 Mercedes-Benz C-Class
2007 Mercedes-Benz C-Class C 230 Sport - $2,499 2499.0 Mercedes-Benz C-Class
2009 Mercedes-Benz C-Class - $5,299 5299.0 Mercedes-Benz C-Class
2009 Mercedes-Benz C-Class C 300 4MATIC Luxury - $6,499 6499.0 Mercedes-Benz C-Class
2008 Mercedes-Benz C-Class C 300 Luxury - $5,950 5950.0 Mercedes-Benz C-Class
2008 Mercedes-Benz C-Class C 300 Luxury 4MATIC - $6,650 6650.0 Mercedes-Benz C-Class
2005 Mercedes-Benz C-Class C 230 Kompressor Supercharged Sedan - $2,995 2995.0 Mercedes-Benz C-Class
2007 Mercedes-Benz C-Class C 230 Sport - $4,900 4900.0 Mercedes-Benz C-Class
2008 Mercedes-Benz C-Class C 350 Sport - $6,400 6400.0 Mercedes-Benz C-Class
2009 Mercedes-Benz C-Class C 300 4MATIC Luxury - $6,900 6900.0 Mercedes-Benz C-Class
2008 Mercedes-Benz C-Class C 300 Sport - $6,200 6200.0 Mercedes-Benz C-Class
2007 Mercedes-Benz C-Class C 280 4MATIC Luxury AWD - $3,830 3830.0 Mercedes-Benz C-Class