Question

我写了很多刮刀但是我不确定如何处理无限卷轴。这些天大多数网站等，Facebook，Pinterest都有无限的滚动条。

Answer 1

您可以使用selenium来删除无限滚动网站，如twitter或facebook。

步骤1：使用pip安装Selenium

pip install selenium

第2步：使用下面的代码自动进行无限滚动并提取源代码

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import NoAlertPresentException
import sys

import unittest, time, re

class Sel(unittest.TestCase):
    def setUp(self):
        self.driver = webdriver.Firefox()
        self.driver.implicitly_wait(30)
        self.base_url = "https://twitter.com"
        self.verificationErrors = []
        self.accept_next_alert = True
    def test_sel(self):
        driver = self.driver
        delay = 3
        driver.get(self.base_url + "/search?q=stckoverflow&src=typd")
        driver.find_element_by_link_text("All").click()
        for i in range(1,100):
            self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(4)
        html_source = driver.page_source
        data = html_source.encode('utf-8')


if __name__ == "__main__":
    unittest.main()

步骤3：如果需要，打印数据。

Answer 2

大多数具有无限滚动功能的网站（如Lattyware笔记所示）也有适当的API，使用此功能而非抓取功能可能会更好。

但如果你必须刮......

当您到达页面底部时，此类网站正在使用JavaScript从网站请求其他内容。您需要做的就是找出该附加内容的URL，然后您可以检索它。可以通过检查脚本，使用Firefox Web控制台或使用debug proxy来确定所需的URL。

例如，打开Firefox Web Console，关闭除Net之外的所有过滤器按钮，然后加载要抓取的站点。您将在加载时看到所有文件。在观看Web控制台的同时滚动页面，您将看到用于其他请求的URL。然后，您可以自己请求该URL并查看数据的格式（可能是JSON）并将其放入Python脚本中。

Answer 3

查找ajax源的url将是最佳选择，但对某些站点来说可能很麻烦。或者，您可以使用来自QWebKit的{{1}}之类的无头浏览器，并在从DOM树中读取数据时发送键盘事件。 PyQt有一个漂亮而简单的api。

用无限滚动抓取网站

3 个答案: