Scrapy Spider不会刮擦Page 1

时间:2015-12-14 14:44:03

标签: scrapy scrapy-spider

我希望我的蜘蛛能够抓住网站每个页面上的列表。我使用了CrawlSpider和LinkExtractor。但是当我查看csv文件时,第一页上没有任何内容(即启动URL)被删除。已删除的项目从第2页开始。我在Scrapy shell上测试了我的爬虫,看起来很好。我无法弄清问题所在。下面是我的蜘蛛代码。请帮忙。非常感谢!

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from shputuo.items_shputuo import ShputuoItem


class Shputuo(CrawlSpider):
    name = "shputuo"

    allowed_domains = ["shpt.gov.cn"] # DO NOT use www in allowed domains
    start_urls =  ["http://www.shpt.gov.cn/gb/n6132/n6134/n6156/n7110/n7120/index.html"] 

    rules = (
        Rule(LinkExtractor(allow=(), restrict_xpaths=("//div[@class = 'page']/ul/li[5]/a",)), callback="parse_items", follow= True),
)    

    def parse_items(self, response):
        for sel in response.xpath("//div[@class = 'neirong']/ul/li"):
            item = ShputuoItem()
            word = sel.xpath("a/text()").extract()[0]
            item['id'] = word[3:11]
            item['title'] = word[11:len(word)]
            item['link'] = "http://www.shpt.gov.cn" + sel.xpath("a/@href").extract()[0]
            item['time2'] = sel.xpath("span/text()").extract()[0][1:11]

            request = scrapy.Request(item['link'], callback = self.parse_content)
            request.meta['item'] = item            

            yield request

    def parse_content(self, response):
        item = response.meta['item']
        item['question'] = response.xpath("//div[@id = 'ivs_content']/p[2]/text()").extract()[0]
        item['question'] = "".join(map(unicode.strip, item['question'])) # get rid of unwated spaces and others
        item['reply'] =  response.xpath("//div[@id = 'ivs_content']/p[3]/text()").extract()[0]
        item['reply'] = "".join(map(unicode.strip, item['reply']))
        item['agency'] = item['reply'][6:10]
        item['time1'] = "2015-" + item['question'][0] + "-" + item['question'][2]


        yield item

1 个答案:

答案 0 :(得分:1)

看起来你真正需要做的就是解析start_urls个请求的元素而不是仅遵循规则。

为此使用parse_start_url方法,默认情况下是start_urls次请求的回调。