我正在尝试使用ajax调用来抓取下一页的电子商务网站。
我能够抓取第1页上的数据,但是当我将第1页滚动到底部时,第2页会自动通过ajax调用加载。
我的代码:
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as ureq
my_url='http://www.shopclues.com/mobiles-smartphones.html'
page=ureq(my_url).read()
page_soup=soup(page,"html.parser")
containers=page_soup.findAll("div",{"class":"column col3"})
for container in containers:
name=container.h3.text
price=container.find("span",{'class':'p_price'}).text
print("Name : "+name.replace(","," "))
print("Price : "+price)
for i in range(2,7):
my_url="http://www.shopclues.com/ajaxCall/moreProducts?catId=1431&filters=&pageType=c&brandName=&start="+str(36*(i-1))+"&columns=4&fl_cal=1&page="+str(i)
page=ureq(my_url).read()
print(page)
page_soup=soup(page,"html.parser")
containers=page_soup.findAll("div",{"class":"column col3"})
for container in containers:
name=container.h3.text
price=container.find("span",{'class':'p_price'}).text
print("Name : "+name.replace(","," "))
print("Price : "+price)
我已经打印了ureq读取的ajax页面,知道我是否能够打开ajax页面并得到一个输出:
B' '是产出: 打印(页)
请为我提供一个解决剩余数据的解决方案。
答案 0 :(得分:2)
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup as soup
from urllib2 import urlopen as ureq
import random
import time
chrome_options = webdriver.ChromeOptions()
prefs = {"profile.default_content_setting_values.notifications": 2}
chrome_options.add_experimental_option("prefs", prefs)
# A randomizer for the delay
seconds = 5 + (random.random() * 5)
# create a new Chrome session
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.implicitly_wait(30)
# driver.maximize_window()
# navigate to the application home page
driver.get("http://www.shopclues.com/mobiles-smartphones.html")
time.sleep(seconds)
time.sleep(seconds)
# Add more to range for more phones
for i in range(1):
element = driver.find_element_by_id("moreProduct")
driver.execute_script("arguments[0].click();", element)
time.sleep(seconds)
time.sleep(seconds)
html = driver.page_source
page_soup = soup(html, "html.parser")
containers = page_soup.findAll("div", {"class": "column col3"})
for container in containers:
# Add error handling
try:
name = container.h3.text
price = container.find("span", {'class': 'p_price'}).text
print("Name : " + name.replace(",", " "))
print("Price : " + price)
except AttributeError:
continue
driver.quit()
我使用selenium加载网站,然后点击按钮加载更多结果。然后获取生成的html并输入您的代码。