Question

上下文

我正在尝试抓取Google Play网站上的某个页面
当我使用浏览器浏览该页面并使用浏览器滚动向下滚动时，我获得了新的应用程序/项目。这绝对是一个AJAX电话。

问题：

我不知道如何使用Scrapy-当我使用浏览器滚动滚动时获得的应用程序。

我尝试过：

我抓取了该页面并打印了响应：

enter image description here

如你所见，有一个加载信号，它不会使用浏览器出现，因为它会自动调用AJAX调用。

注意：

我知道我们可以使用Scrapy调用HXR AJAX调用，但是我希望我的蜘蛛抓取该页面直到没有应用程序，所以蜘蛛应该（如果有的话）自动知道AJAX调用。

我在Windows 7 64bit上使用python 2.7.9和Scrapy 0.26

注2：

我已经检查了this thread

非常感谢

Answer 1

这是使用Selenium Webdriver向您展示问题的可能解决方案的基本方法（不是非常pythonic）。

基本理念是：

创建无头浏览器（webdriver.Firefox()）
将网页加载一页（self.driver.get(response.url)）
虽然元素不可见，但继续将页面内的焦点移动到它

这样页面保持加载元素。

import scrapy
import time
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from scrapy.contrib.spiders import CrawlSpider    

class googleplay(CrawlSpider):
    name = "googleplay"
    allowed_domains = ["play.google.com"]
    start_urls = ["https://play.google.com"]

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get(response.url)      
        copyright = self.driver.find_element_by_class_name('copyright')
        ActionChains(self.driver).move_to_element(copyright).perform()

        while not copyright.is_displayed():
            copyright = self.driver.find_element_by_class_name('copyright')
            time.sleep(3) #to let page content loading
            ActionChains(self.driver).move_to_element(copyright).perform()

        #scrape by here

在循环结束时，您确定所有页面都已加载，您可以使用代码来抓取内容

此处的文档：http://selenium-python.readthedocs.org/en/latest/navigating.html

如何使用scrapy抓取Google Play网站

上下文

问题：

我尝试过：

注意：

注2：

1 个答案: