Question

我正在使用刮刀从http://www.johnlscott.com/agent-search.aspx爬到办公室名单。

办公室名单地址如下所示：http://www.johnlscott.com/agent-search.aspx?p=agentResults.asp&OfficeID=8627 - 但Scrapy抓取http://www.johnlscott.com/agent-search.aspx?OfficeID=8627&p=agentResults.asp这是一个死页。 交换.aspx之后的两个部分。

我甚至将每个地址明确地手动加载为start_urls，它仍然会发生。

我在python-2.7，Windows 8.1上使用最新的Scrapy

代码示例：

class JLSSpider(CrawlSpider):

    name = 'JLS'
    allowed_domains = ['johnlscott.com']
    # start_urls = ['http://www.johnlscott.com/agent-search.aspx']

    rules = (
        Rule(callback="parse_start_url", follow=True),)

    def start_requests(self):
        with open('hrefnums.csv', 'rbU') as ifile:
            read = csv.reader(ifile)
            for row in read:
                for col in row:
                    # I have a csv of the office IDs: (Just letting it crawl through them creates the same issue)
                    yield self.make_requests_from_url("http://www.johnlscott.com/agent-search.aspx?p=agentResults.asp&OfficeID=%s" % col)


    def parse_start_url(self, response):
        items = []
        sel = Selector(response)
        sections = sel.xpath("//tr/td/table[@id='tbAgents']/tr")
        for section in sections:
            item = JLSItem()
            item['name'] = section.xpath("td[2]/text()")[0].extract().replace(u'\xa0', ' ').strip()         
            items.append(item)
        return(items)

Answer 1

没有像这样爬行的问题：

from scrapy.contrib.spiders import CrawlSpider
from scrapy.http import Request


class JLSSpider(CrawlSpider):
    name = 'JLS'
    allowed_domains = ['johnlscott.com']

    def start_requests(self):
        yield Request("http://www.johnlscott.com/agent-search.aspx?p=agentResults.asp&OfficeID=8627", callback=self.parse_item)

    def parse_item(self, response):
        print response.body

Scrapy在url Python中颠倒了参数的顺序

1 个答案: