我正在学习Scrapy。现在我只是尝试擦拭物品,当我打电话给蜘蛛时:
planefinder]# scrapy crawl planefinder -o /User/spider/planefinder/pf.csv -t csv
它显示技术信息,没有抓取内容(抓取0页......等),它返回一个空的csv文件。 问题是当我在scrapy shell中测试xpath时它可以工作:
>>> from scrapy.selector import Selector
>>> sel = Selector(response)
>>> flights = sel.xpath("//div[@class='col-md-12'][1]/div/div/table//tr")
>>> items = []
>>> for flt in flights:
... item = flt.xpath("td[1]/a/@href").extract_first()
... items.append(item)
...
>>> items
以下是我的planeFinder.py代码:
# -*-:coding:utf-8 -*-
from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector, HtmlXPathSelector
from planefinder.items import arr_flt_Item, dep_flt_Item
class planefinder(CrawlSpider):
name = 'planefinder'
host = 'https://planefinder.net'
start_url = ['https://planefinder.net/data/airport/PEK/']
def parse(self, response):
arr_flights = response.xpath("//div[@class='col-md-12'][1]/div/div/table//tr")
dep_flights = response.xpath("//div[@class='col-md-12'][2]/div/div/table//tr")
for flight in arr_flights:
arr_item = arr_flt_Item()
arr_flt_url = flight.xpath('td[1]/a/@href').extract_first()
arr_item['arr_flt_No'] = flight.xpath('td[1]/a/text()').extract_first()
arr_item['STA'] = flight.xpath('td[2]/text()').extract_first()
arr_item['From'] = flight.xpath('td[3]/a/text()').extract_first()
arr_item['ETA'] = flight.xpath('td[4]/text()').extract_first()
yield arr_item
答案 0 :(得分:0)
这里的问题是没有正确理解使用哪个“蜘蛛”,因为Scrapy
提供了不同的自定义。
主要的,您应该使用的是简单Spider
而不是CrawlSpider
,因为CrawlSpider
用于更深入,更深入地搜索论坛,博客,等
只需将蜘蛛类型更改为:
from scrapy import Spider
class plane finder(Spider):
...
答案 1 :(得分:0)
检查settings.py文件中ROBOTSTXT_OBEY的值。默认情况下,它设置为True(但不是在运行shell时)。如果您不想违反robots.txt文件,请将其设置为False。
答案 2 :(得分:0)
请在转到CrawlSpider
之前查看Spider
的文档,我发现的一些问题是:
host
使用allowed_domains
start_url
使用start_urls
试试这个(我也改了一下:
# -*-:coding:utf-8 -*-
from scrapy import Field, Item, Request
from scrapy.spiders import CrawlSpider, Spider
class ArrivalFlightItem(Item):
arr_flt_no = Field()
arr_sta = Field()
arr_from = Field()
arr_eta = Field()
class PlaneFinder(Spider):
name = 'planefinder'
allowed_domains = ['planefinder.net']
start_urls = ['https://planefinder.net/data/airports']
def parse(self, response):
yield Request('https://planefinder.net/data/airport/PEK', callback=self.parse_flight)
def parse_flight(self, response):
flights_xpath = ('//*[contains(@class, "departure-board") and '
'./preceding-sibling::h2[contains(., "Arrivals")]]'
'//tr[not(./th) and not(./td[@class="spacer"])]')
for flight in response.xpath(flights_xpath):
arrival = ArrivalFlightItem()
arr_flt_url = flight.xpath('td[1]/a/@href').extract_first()
arrival['arr_flt_no'] = flight.xpath('td[1]/a/text()').extract_first()
arrival['arr_sta'] = flight.xpath('td[2]/text()').extract_first()
arrival['arr_from'] = flight.xpath('td[3]/a/text()').extract_first()
arrival['arr_eta'] = flight.xpath('td[4]/text()').extract_first()
yield arrival