使用scrapy(和selenium?)动态生成数据

时间:2015-09-17 12:19:18

标签: python selenium web-scraping scrapy

我正在努力获得scrapy(有或没有硒)从网页中提取动态生成的内容。该网站列出了不同大学的表现,并允许您选择该大学提供的每个学习区域。例如,从下面代码中列出的页面,我希望能够提取大学名称(“邦德大学”)和“整体体验质量”(91.3%)的价值。

但是,当我使用'view source',curl或scrapy时,不会显示实际值。例如。我希望看到Uni的名字,它显示:

<h1 class="inline-block instiution-name" data-bind="text: Description"></h1>

但如果我使用firebug或chrome来检查元素,它会显示

<h1 class="inline-block instiution-name" data-bind="text: Description">Bond University</h1>

进一步检查,在firebug的'Net'选项卡上,我可以看到正在进行的AJAX(?)调用返回相关信息,但我无法在scrapy甚至卷曲中模仿这个(是的,我确实在搜索并花了很长时间尝试我害怕)。

请求标题

POST /Websilk/DataServices/SurveyData.asmx/FetchInstitutionStudyAreaData HTTP/1.1
Host: www.qilt.edu.au
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:39.0) Gecko/20100101 Firefox/39.0
Accept: application/json, text/javascript, */*; q=0.01
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Content-Type: application/json; charset=utf-8
X-Requested-With: XMLHttpRequest
Referer: http://www.qilt.edu.au/institutions/institution/bond-university/business-management
Content-Length: 36
Cookie: _ga=GA1.3.69062787.1442441726; ASP.NET_SessionId=lueff4ysg3yvd2csv5ixsc1f; _gat=1
Connection: keep-alive
Pragma: no-cache
Cache-Control: no-cache

与请求一起传递的POST参数

{"InstitutionId":20,"StudyAreaId":0}

作为第二种选择,我尝试使用Selenium和scrapy,因为我认为它可能会“看到”真正的价值观,就像浏览器一样,但无济于事。我到目前为止的主要尝试如下:

import scrapy
import time  #used for the sleep() function

from selenium import webdriver

class QiltSpider(scrapy.Spider):
    name = "qilt"

    allowed_domains = ["qilt.edu.au"]
    start_urls = [
        "http://www.qilt.edu.au/institutions/institution/rmit-university/architecture-building/"
    ]

    def __init__(self):
        self.driver = webdriver.Firefox()
        self.driver.get('http://www.qilt.edu.au/institutions/institution/rmit-university/architecture-building/')
        time.sleep(5) # tried pausing, in case problem was delayed loading - didn't work

    def parse(self, response):
        # parse the response to find the uni name and show in console (using xpath code from firebug). This find the relevant section, but it shows as empty
        title = response.xpath('//*[@id="bd"]/div[2]/div/div/div[1]/div/div[2]/h1').extract()
        print title
        # dumping the whole response to a file so I can check whether dynamic values were captured
        with open("extract.html", 'wb') as f:
            f.write(response.body)
            self.driver.close()

谁能告诉我如何实现这一目标?

非常感谢!

编辑:感谢您的建议到目前为止,但是有关如何使用InstitutionID和StudyAreaID的参数专门模仿AJ​​AX调用的任何想法?我测试它的代码如下所示,但似乎仍然出现错误页面。

import scrapy
from scrapy.http import FormRequest

class HeaderTestSpider(scrapy.Spider):
    name = "headerTest"

    allowed_domains = ["qilt.edu.au"]
    start_urls = [
        "http://www.qilt.edu.au/institutions/institution/rmit-university/architecture-building/"
    ]

    def parse(self, response):
        return [FormRequest(url="http://www.qilt.edu.au/Websilk/DataServices/SurveyData.asmx/FetchInstitutionData",
                            method='POST',  
                            formdata={'InstitutionId':'20', 'StudyAreaId': '0'},
                            callback=self.parser2)]

1 个答案:

答案 0 :(得分:1)

QILT页面使用AJAX从服务器检索数据。这个AJAX请求是使用javascript代码发送的,该代码是使用even document.ready(jQuery)/window.onload(Javascript)触发的(如果你不熟悉javascript,这个方法会在网页加载完成后立即触发浏览器窗口)。由于您使用的是软件来激发页面请求,因此根本不会触发此事件。

对于您正在尝试模拟的AJAX请求,请求正文的类型为Application / JSON。 请在请求中添加以下标题。 内容类型:application / json