我对portia / scrapy有一个小问题,也许有人知道出了什么问题。 我实际上在一个流浪汉环境中使用portia 16.02(在Windows 10上)。我为一个小型私人项目创建了一个蜘蛛。 蜘蛛在expedia.de(重js ......)的特殊日期检查OSL(奥斯陆)和LAX(洛杉矶)之间的flightroute的结果页(作为起始页)。 我对网站进行了注释,样本看起来很棒(我添加了另一个起始页,样本看起来也很好)。所以我在虚拟机中尝试了一个portiacrawl来检查导出:
/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py:115: ScrapyDeprecationWarning: SPIDER_MANAGER_CLASS option is deprecated. Please use SPIDER_LOADER_CLASS.
self.spider_loader = _get_spider_loader(settings)
2016-02-26 14:12:10 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2016-02-26 14:12:10 [scrapy] INFO: Optional features available: ssl, http11
2016-02-26 14:12:10 [scrapy] INFO: Overridden settings: {'FEED_FORMAT': 'xml', 'FEED_URI': 'test.xml', 'DUPEFILTER_CLASS': 'scrapyjs.SplashAwareDupeFilter'}
2016-02-26 14:12:10 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2016-02-26 14:12:13 [py.warnings] WARNING: /usr/local/lib/python2.7/dist-packages/scrapyjs/middleware.py:8: ScrapyDeprecationWarning: Module `scrapy.log` has been deprecated, Scrapy now relies on the builtin Python library for logging. Read the updated logging entry in the documentation to learn more.
from scrapy import log
2016-02-26 14:12:13 [py.warnings] WARNING: /usr/local/lib/python2.7/dist-packages/scrapyjs/dupefilter.py:8: ScrapyDeprecationWarning: Module `scrapy.dupefilter` is deprecated, use `scrapy.dupefilters` instead
from scrapy.dupefilter import RFPDupeFilter
2016-02-26 14:12:13 [py.warnings] WARNING: /usr/local/lib/python2.7/dist-packages/scrapyjs/cache.py:11: ScrapyDeprecationWarning: Module `scrapy.contrib.httpcache` is deprecated, use `scrapy.extensions.httpcache` instead
from scrapy.contrib.httpcache import FilesystemCacheStorage
2016-02-26 14:12:13 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, PageActionsMiddleware, CookiesMiddleware, SlybotJsMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-02-26 14:12:13 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-02-26 14:12:13 [scrapy] INFO: Enabled item pipelines: DupeFilterPipeline
2016-02-26 14:12:13 [scrapy] INFO: Spider opened
2016-02-26 14:12:13 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-02-26 14:12:13 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-02-26 14:12:14 [scrapy] DEBUG: Redirecting (301) to <GET https://www.expedia.de/Flight-SearchResults?inpPackageType=FLIGHT_ONLY&inpInfants=2&inpFlightClass=2&inpDepartureDates=17.03.2016&inpDepartureDates=24.03.2016&inpDepartureTimes=362&inpDepartureTimes=362&inpFlightRouteType=3&inpHotelRoomCount=1&inpFlightAirlinePreference=&inpAdultCounts=1&inpIsNonstopOnly=N&intcp=0&inpChildCounts=0&action=FlightSearchResults%40searchFlights&inpSeniorCounts=0&inpRefundableFlightsOnly=N&inpSortType=0&inpDepartureLocations=Oslo%2C+Norwegen+(OSL-Alle+Flugh%C3%A4fen)&inpDepartureLoc&inttkn=NfXCUxEImb3egksh> from <GET https://goo.gl/6u5fEm>
2016-02-26 14:12:14 [scrapy] DEBUG: Crawled (200) <GET https://www.expedia.de/Flight-SearchResults?inpPackageType=FLIGHT_ONLY&inpInfants=2&inpFlightClass=2&inpDepartureDates=17.03.2016&inpDepartureDates=24.03.2016&inpDepartureTimes=362&inpDepartureTimes=362&inpFlightRouteType=3&inpHotelRoomCount=1&inpFlightAirlinePreference=&inpAdultCounts=1&inpIsNonstopOnly=N&intcp=0&inpChildCounts=0&action=FlightSearchResults%40searchFlights&inpSeniorCounts=0&inpRefundableFlightsOnly=N&inpSortType=0&inpDepartureLocations=Oslo%2C+Norwegen+(OSL-Alle+Flugh%C3%A4fen)&inpDepartureLoc&inttkn=NfXCUxEImb3egksh> (referer: None)
2016-02-26 14:12:15 [scrapy] DEBUG: Crawled (200) <GET https://www.expedia.de/Flight-SearchResults?inpPackageType=FLIGHT_ONLY&inpInfants=2&inpFlightClass=2&inpDepartureDates=17.03.2016&inpDepartureDates=24.03.2016&inpDepartureTimes=362&inpDepartureTimes=362&inpFlightRouteType=3&inpHotelRoomCount=1&inpFlightAirlinePreference=&inpAdultCounts=1&inpIsNonstopOnly=N&intcp=0&inpChildCounts=0&action=FlightSearchResults%40searchFlights&inpSeniorCounts=0&inpRefundableFlightsOnly=N&inpSortType=0&inpDepartureLocations=Oslo%2C+Norwegen+%28OSL-Alle+Flugh%C3%A4fen%29&inpDepartureLocations=Los+Angeles%2C+CA%2C+USA+%28LAX-Los+Angeles+Intl.%29&inpArrivalLocations=Los+Angeles%2C+CA%2C+USA+%28LAX-Los+Angeles+Intl.%29&inpArrivalLocations=Oslo%2C+Norwegen+%28OSL-Alle+Flugh%C3%A4fen%29&inttkn=mrRyD4CXEnw3zhjf> (referer: None)
2016-02-26 14:12:15 [scrapy] INFO: Closing spider (finished)
2016-02-26 14:12:15 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1869,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 118471,
'downloader/response_count': 3,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/301': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 2, 26, 14, 12, 15, 963588),
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'log_count/WARNING': 3,
'response_received_count': 2,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2016, 2, 26, 14, 12, 13, 657551)}
有些警告没问题,之后我检查了exportfile但导出文件为空。我再次使用指定的xml文件作为输出文件进行测试,但这些文件也是空的。 有人知道为什么我没有出口吗?
谢谢!
此致 蒂莫