我正在尝试无限滚动地抓取网页中的链接。我只能提取第一个窗格上的链接。如何继续进行操作,以形成所有链接的完整列表。这是我到目前为止所拥有的-
from bs4 import BeautifulSoup
import requests
html = "https://www.carwale.com/used/cars-for-sale/#sc=-1&so=-1&car=7&pn=8&lcr=168&ldr=0&lir=0"
html_content = requests.get(html).text
soup = BeautifulSoup(html_content, "lxml")
table = soup.find_all("div", {"class": "card-detail-block__data"})
y = []
for i in table:
try:
y.append(i.find("a", {"id":"linkToDetails"}).get('href'))
except AttributeError:
pass
z = [('carwale.com' + item) for item in y]
z
答案 0 :(得分:1)
您根本不需要BeautifulSoup即可忍者HTML dom,因为该网站提供了填充HTML的JSON响应。仅请求即可完成工作。如果您通过Chrome或Firefox开发工具监视“网络”,则会看到对于每次加载,浏览器都会向API发送获取请求。使用它,我们可以获得干净的json数据。
免责声明:我尚未检查此网站是否允许网页抓取。请仔细检查其使用条款。我以为你是这样做的。
我使用了Pandas,以帮助处理表格数据,还可以将数据导出为CSV或您喜欢的任何格式:pip install pandas
import pandas as pd
from requests import Session
# Using Session and a header
req = Session()
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '\
'AppleWebKit/537.36 (KHTML, like Gecko) '\
'Chrome/75.0.3770.80 Safari/537.36',
'Content-Type': 'application/json;charset=UTF-8'}
# Add headers
req.headers.update(headers)
BASE_URL = 'https://www.carwale.com/webapi/classified/stockfilters/'
# Monitoring the updates on Network, the params changes in each load
#sc=-1&so=-1&car=7&pn=1
#sc=-1&so=-1&car=7&pn=2&lcr=24&ldr=0&lir=0
#sc=-1&so=-1&car=7&pn=3&lcr=48&ldr=0&lir=0
#sc=-1&so=-1&car=7&pn=4&lcr=72&ldr=0&lir=0
params = dict(sc=-1, so=-1, car=7, pn=4, lcr=72, ldr=0, lir=0)
r = req.get(BASE_URL, params=params) #just like requests.get
# Check if everything is okay
assert r.ok, 'We did not get 200'
# get json data
data = r.json()
# Put it in DataFrame
df = pd.DataFrame(data['ResultData'])
print(df.head())
# to go to another page create a function:
def scrap_carwale(params):
r = req.get(BASE_URL, params=params)
if not r.ok:
raise ConnectionError('We did not get 200')
data = r.json()
return pd.DataFrame(data['ResultData'])
# Just first 5 pages :)
for i in range(5):
params['pn']+=1
params['lcr']*=2
dt = scrap_carwale(params)
#append your data
df = df.append(dt)
#print data sample
print(df.sample(10)
# Save data to csv or whatever format
df.to_csv('my_data.csv') #see df.to_?
答案 1 :(得分:0)
尝试一下
next_page = next_page = soup.find('a', rel='next', href=True)
if next_page:
next_html_content = requests.get(next_page).text
下一页URL隐藏在站点源中。您可以通过在浏览器中搜索rel="next"
标签来找到它。