Question

我是Scrapy的新手，通过一些教程，我能够抓住一些简单的网站，但我现在面临着一个新网站的问题，我必须填写一个搜索表单并提取结果。我得到的回应没有结果。

比如说，对于以下网站：http://www.beaurepaires.com.au/store-locator/

我想提供一个邮政编码列表，并在每个邮政编码中提取有关商店的信息（商店名称和地址）。

我正在使用以下代码，但它不起作用，我不知道从哪里开始。

class BeaurepairesSpider(BaseSpider):
    name = "beaurepaires"
    allowed_domains = ["http://www.beaurepaires.com.au"]
    start_urls = ["http://www.beaurepaires.com.au/store-locator/"]
    #start_urls = ["http://www.beaurepaires.com.au/"]

    def parse(self, response):
        yield FormRequest.from_response(response, formname='frm_dealer_locator', formdata={'dealer_postcode_textfield':'2115'}, callback=self.parseBeaurepaires)

    def parseBeaurepaires(self, response):
        hxs = HtmlXPathSelector(response)
        filename = "postcodetest3.txt"
        open(filename, 'wb').write(response.body)
        table = hxs.select("//div[@id='jl_results']/table/tbody")
        headers = table.select("tr[position()<=1]")
        data_rows = table.select("tr[position()>1]")

谢谢！

Answer 1

此处的页面加载大量使用javascript，对于Scrapy而言过于复杂。以下是我提出的一个例子：

import re
from scrapy.http import FormRequest, Request
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider


class BeaurepairesSpider(BaseSpider):
    name = "beaurepaires"
    allowed_domains = ["beaurepaires.com.au", "gdt.rightthere.com.au"]
    start_urls = ["http://www.beaurepaires.com.au/store-locator/"]

    def parse(self, response):
        yield FormRequest.from_response(response, formname='frm_dealer_locator',
                                        formdata={'dealer_postcode_textfield':'2115'},
                                        callback=self.parseBeaurepaires)

    def parseBeaurepaires(self, response):
        hxs = HtmlXPathSelector(response)

        script = str(hxs.select("//div[@id='jl_container']/script[4]/text()").extract()[0])
        url, script_name = re.findall(r'LoadScripts\("([a-zA-Z:/\.]+)", "(\w+)"', script)[0]
        url = "%s/locator/js/data/%s.js" % (url, script_name)
        yield Request(url=url, callback=self.parse_js)

    def parse_js(self, response):
        print response.body  # here are your locations - right, inside the js file

看到使用正则表达式，硬编码的网址，你必须解析js以获得你的位置 - 即使你完成它并获得位置也太脆弱了。

只需切换到selenium之类的浏览器内工具（或将scrapy与之结合使用）。

使用scrapy提取动态数据 - 基于邮政编码的位置

1 个答案: