大家好,我用python在python中建立了一个脚本,使用selenium滚动无限,然后单击“加载更多”按钮,显然它只给我一半的产品,而且还很费时间。将所有产品链接保存到一个csv文件中,并获取所有链接,我编写的脚本是:
from selenium import webdriver
import time
from selenium.common.exceptions import WebDriverException
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import NoSuchWindowException
path_to_chromedriver = 'C:/Users/Admin/AppData/Local/Programs/Python/Python37-32/chromedriver.exe'
chrome_options = webdriver.ChromeOptions()
prefs = {"profile.default_content_setting_values.notifications": 2}
chrome_options.add_experimental_option("prefs", prefs)
chrome_options.add_argument("start-maximized")
browser = webdriver.Chrome(options=chrome_options, executable_path=path_to_chromedriver)
with open('E:/grainger2.txt','r', encoding='utf-8-sig') as f:
content = f.readlines()
content = [x.strip() for x in content]
with open('E:/grainger11.csv', 'a', encoding="utf-8") as f:
headers = ("link,sublink")
f.write(headers)
f.write("\n")
for dotnum in content:
browser.get(dotnum)
SCROLL_PAUSE_TIME = 1
# Get scroll height
last_height = browser.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to bottom
browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
# Calculate new scroll height and compare with last scroll height
new_height = browser.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
while True:
try:
try:
loadMoreButton = browser.find_element_by_css_selector(".btn.list-view__load-more.list-view__load-more--js")
loadMoreButton.click()
time.sleep(2)
except NoSuchWindowException:
pass
except Exception as e:
break
try:
try:
for links in browser.find_elements_by_css_selector(".list-view__product.list-view__product--js"):
aa = links.get_attribute("data-url-ie8")
print(aa)
ana = "loadlink"
f.write(ana+","+dotnum+","+aa+"\n")
except NoSuchWindowException:
pass
except NoSuchElementException:
pass
使用上述脚本,我仅获得200个产品链接,但是该链接包含9748个产品,如果有人可以帮助我,我希望提取所有链接
答案 0 :(得分:0)
我认为您为此付出了更多的麻烦。
我建议您使用易碎的独立版(不需要硒),然后使用页面上的隐藏页面链接迭代所有页面。查看源代码...
@AutoConfigureTestDatabase
您将能够像这样在Scrapy中编写分页...
<section class="searchControls paginator-control">
<a
href="/category/drill-bushings/machine-tool-accessories/machining/ecatalog/N-hg1?searchRedirect=products&requestedPage=2"
class="btn list-view__load-more list-view__load-more--js"
data-current-page="1"
data-product-offset="32"
data-total-products="9749"
data-page-url="/category/drill-bushings/machine-tool-accessories/machining/ecatalog/N-hg1?searchRedirect=products"
id="list-view__load-more--js">
View More
</a>
</section>
Id建议您重写此代码,并使用此分页块获得相同的结果,这将导致更复杂的解决方案。
要查看基本示例,请参见此Scrapy information on following links