从无限滚动网站上抓取内容

时间:2020-02-15 09:52:35

标签: selenium web-scraping beautifulsoup scrapy

我正在尝试无限滚动地抓取网页中的链接。我只能提取第一个窗格上的链接。如何继续进行操作,以形成所有链接的完整列表。这是我到目前为止所拥有的-


from bs4 import BeautifulSoup
import requests

html = "https://www.carwale.com/used/cars-for-sale/#sc=-1&so=-1&car=7&pn=8&lcr=168&ldr=0&lir=0"
html_content = requests.get(html).text
soup = BeautifulSoup(html_content, "lxml")
table = soup.find_all("div", {"class": "card-detail-block__data"})

y = []
for i in table:
    try:
        y.append(i.find("a", {"id":"linkToDetails"}).get('href'))
    except AttributeError:
        pass

z = [('carwale.com' + item) for item in y]
z

2 个答案:

答案 0 :(得分:1)

您根本不需要BeautifulSoup即可忍者HTML dom,因为该网站提供了填充HTML的JSON响应。仅请求即可完成工作。如果您通过Chrome或Firefox开发工具监视“网络”,则会看到对于每次加载,浏览器都会向API发送获取请求。使用它,我们可以获得干净的json数据。

免责声明:我尚未检查此网站是否允许网页抓取。请仔细检查其使用条款。我以为你是这样做的。

我使用了Pandas,以帮助处理表格数据,还可以将数据导出为CSV或您喜欢的任何格式:pip install pandas

import pandas as pd
from requests import Session

# Using Session and a header
req = Session() 
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '\
                         'AppleWebKit/537.36 (KHTML, like Gecko) '\
                         'Chrome/75.0.3770.80 Safari/537.36',
          'Content-Type': 'application/json;charset=UTF-8'}
# Add headers
req.headers.update(headers)

BASE_URL = 'https://www.carwale.com/webapi/classified/stockfilters/'

# Monitoring the updates on Network, the params changes in each load
#sc=-1&so=-1&car=7&pn=1
#sc=-1&so=-1&car=7&pn=2&lcr=24&ldr=0&lir=0
#sc=-1&so=-1&car=7&pn=3&lcr=48&ldr=0&lir=0
#sc=-1&so=-1&car=7&pn=4&lcr=72&ldr=0&lir=0

params = dict(sc=-1, so=-1, car=7, pn=4, lcr=72, ldr=0, lir=0)

r = req.get(BASE_URL, params=params) #just like requests.get

# Check if everything is okay
assert r.ok, 'We did not get 200'

# get json data
data = r.json()

# Put it in DataFrame
df = pd.DataFrame(data['ResultData'])

print(df.head())

# to go to another page create a function:

def scrap_carwale(params):
    r = req.get(BASE_URL, params=params)
    if not r.ok:
        raise ConnectionError('We did not get 200')
    data = r.json()

    return  pd.DataFrame(data['ResultData'])


# Just first 5 pages :)    
for i in range(5):
    params['pn']+=1
    params['lcr']*=2

    dt = scrap_carwale(params)
    #append your data
    df = df.append(dt)

#print data sample
print(df.sample(10)

# Save data to csv or whatever format
df.to_csv('my_data.csv') #see df.to_?

这是网络 enter image description here

响应: enter image description here

结果样本 enter image description here

答案 1 :(得分:0)

尝试一下

next_page = next_page = soup.find('a', rel='next', href=True)

if next_page:
   next_html_content = requests.get(next_page).text

下一页URL隐藏在站点源中。您可以通过在浏览器中搜索rel="next"标签来找到它。