Question

我一直在做研究，到目前为止我发现了我计划使用它的scrapy的python包，现在我试图找出使用scrapy爬行构建一个刮刀的好方法无限滚动的网站。在挖掘后我发现有一个包调用selenium并且它有python模块。我有一种感觉，有人已经使用Scrapy和Selenium通过无限滚动来抓取网站。如果有人可以指出一个例子，那就太好了。

Answer 1

from selenium.webdriver.common.keys import Keys
import selenium.webdriver
driver = selenium.webdriver.Firefox()
driver.get("http://www.something.com")
lastElement = driver.find_elements_by_id("someId")[-1]
lastElement.send_keys(Keys.NULL)

这将打开一个页面，找到具有给定id的最底部元素，并将该元素滚动到视图中。当页面加载更多时，你必须不断查询驱动程序以获取最后一个元素，并且随着页面变大，我发现这很慢。调用driver.find_element_*的时间占主导地位，因为我不知道如何明确查询页面中的最后一个元素。

通过实验，您可能会发现页面动态加载的元素数量有一个上限，如果您编写的内容加载了该数字，然后只调用driver.find_element_*，那么最好。< / p>

Answer 2

您可以使用selenium来删除无限滚动网站，如twitter或facebook。

步骤1：使用pip安装Selenium

pip install selenium

第2步：使用下面的代码自动进行无限滚动并提取源代码

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import NoAlertPresentException
import sys

import unittest, time, re

class Sel(unittest.TestCase):
    def setUp(self):
        self.driver = webdriver.Firefox()
        self.driver.implicitly_wait(30)
        self.base_url = "https://twitter.com"
        self.verificationErrors = []
        self.accept_next_alert = True
    def test_sel(self):
        driver = self.driver
        delay = 3
        driver.get(self.base_url + "/search?q=stackoverflow&src=typd")
        driver.find_element_by_link_text("All").click()
        for i in range(1,100):
            self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(4)
        html_source = driver.page_source
        data = html_source.encode('utf-8')


if __name__ == "__main__":
    unittest.main()

for循环允许您解析无限卷轴并发布可以提取已加载数据的内容。

步骤3：如果需要，打印数据。

Answer 3

对于无限滚动，将数据请求给Ajax调用。打开Web浏览器-> network_tab->通过单击诸如stop之类的图标清除以前的请求历史记录->滚动网页->现在您可以找到新的滚动事件请求->打开请求标头->您可以找到请求的URL --->在单独的选项卡中复制并粘贴URL->您可以找到Ajax调用的结果->只需形成请求的URL即可获取数据页，直到页面结束

Answer 4

这是对我有用的简短代码：

SCROLL_PAUSE_TIME = 20

# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

posts = driver.find_elements_by_class_name("post-text")

for block in posts:
    print(block.text)

使用python进行无限滚动的爬网站点

4 个答案: