Scrapy在shell中工作但蜘蛛返回空csv

时间:2017-10-30 14:16:05

标签: shell csv scrapy

我正在学习Scrapy。现在我只是尝试擦拭物品,当我打电话给蜘蛛时:

planefinder]# scrapy crawl planefinder -o /User/spider/planefinder/pf.csv -t csv

它显示技术信息,没有抓取内容(抓取0页......等),它返回一个空的csv文件。 问题是当我在scrapy shell中测试xpath时它可以工作:

>>> from scrapy.selector import Selector
>>> sel = Selector(response)
>>> flights = sel.xpath("//div[@class='col-md-12'][1]/div/div/table//tr")
>>> items = []
>>> for flt in flights:
...     item = flt.xpath("td[1]/a/@href").extract_first()
...     items.append(item)
... 
>>> items

以下是我的planeFinder.py代码:

# -*-:coding:utf-8 -*-

from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector, HtmlXPathSelector
from planefinder.items import arr_flt_Item, dep_flt_Item


class planefinder(CrawlSpider):
    name = 'planefinder'
    host = 'https://planefinder.net'
    start_url = ['https://planefinder.net/data/airport/PEK/']


    def parse(self, response):
        arr_flights = response.xpath("//div[@class='col-md-12'][1]/div/div/table//tr")
        dep_flights = response.xpath("//div[@class='col-md-12'][2]/div/div/table//tr")   

        for flight in arr_flights:
            arr_item = arr_flt_Item()

            arr_flt_url = flight.xpath('td[1]/a/@href').extract_first()
            arr_item['arr_flt_No'] = flight.xpath('td[1]/a/text()').extract_first()
            arr_item['STA'] = flight.xpath('td[2]/text()').extract_first()
            arr_item['From'] = flight.xpath('td[3]/a/text()').extract_first()
            arr_item['ETA'] = flight.xpath('td[4]/text()').extract_first()

            yield arr_item

3 个答案:

答案 0 :(得分:0)

这里的问题是没有正确理解使用哪个“蜘蛛”,因为Scrapy提供了不同的自定义。

主要的,您应该使用的是简单Spider而不是CrawlSpider,因为CrawlSpider用于更深入,更深入地搜索论坛,博客,等

只需将蜘蛛类型更改为:

from scrapy import Spider

class plane finder(Spider):
    ...

答案 1 :(得分:0)

检查settings.py文件中ROBOTSTXT_OBEY的值。默认情况下,它设置为True(但不是在运行shell时)。如果您不想违反robots.txt文件,请将其设置为False。

答案 2 :(得分:0)

请在转到CrawlSpider之前查看Spider的文档,我发现的一些问题是:

  • 而不是host使用allowed_domains
  • 而不是start_url使用start_urls
  • 看起来该页面需要设置一些cookie,或者它可能正在使用某种基本的反机器人保护,你需要首先登陆其他地方。

试试这个(我也改了一下:

# -*-:coding:utf-8 -*-

from scrapy import Field, Item, Request
from scrapy.spiders import CrawlSpider, Spider

class ArrivalFlightItem(Item):
    arr_flt_no = Field()
    arr_sta = Field()
    arr_from = Field()
    arr_eta = Field()


class PlaneFinder(Spider):
    name = 'planefinder'
    allowed_domains = ['planefinder.net']
    start_urls = ['https://planefinder.net/data/airports']

    def parse(self, response):
        yield Request('https://planefinder.net/data/airport/PEK', callback=self.parse_flight)


    def parse_flight(self, response):
        flights_xpath = ('//*[contains(@class, "departure-board") and '
                         './preceding-sibling::h2[contains(., "Arrivals")]]'
                         '//tr[not(./th) and not(./td[@class="spacer"])]')

        for flight in response.xpath(flights_xpath):
            arrival = ArrivalFlightItem()
            arr_flt_url = flight.xpath('td[1]/a/@href').extract_first()
            arrival['arr_flt_no'] = flight.xpath('td[1]/a/text()').extract_first()
            arrival['arr_sta'] = flight.xpath('td[2]/text()').extract_first()
            arrival['arr_from'] = flight.xpath('td[3]/a/text()').extract_first()
            arrival['arr_eta'] = flight.xpath('td[4]/text()').extract_first()

            yield arrival