Scrapy - 使用其他网站的抓取结果提交表单

时间:2015-10-25 08:39:13

标签: python scrapy

我正试图用另一个电视指南网站上的电影刮掉网站themoviedb.org。其背后的想法是获取将在未来几天展示的电影的电影信息(评级,发布日期......)。

所以我从第一个网站上删除了电影名称,并希望通过themoviedb.org上的搜索表单获取更多信息。

    def parse(self, response):
    for col_inner in response.xpath('//div[@class="grid__col__inner"]'):
        chnl = col_inner.xpath('.//div[@class="tv-guide__channel"]/h6/a/text()').extract()
        for program in col_inner.xpath('.//div[@class="program"]'):
            item = TVGuideItem()
            item['channel'] = chnl
            title = program.xpath('.//div[@class="title"]/a/text()').extract()
            title = unidecode.unidecode(title[0])  # Replace special characters with characters without accents, ...
            title = urllib.quote_plus(title)  # Create valid url parameter
            item['title'] = list(title)
            item['start_ts'] = program.xpath('.//div[@class="time"]/text()').extract()

            # Extract information from the Movie Database www.themoviedb.org
            request = scrapy.FormRequest(url="https://www.themoviedb.org/", formdata={'query': title]}, callback=self.parse_tmdb)

            request.meta['item'] = item  # Pass the item with the request to the detail page

            yield request


def parse_tmdb(self, response):
    item = response.meta['item']  # Use the passed item
    film = response.xpath('//div[@class="item poster card"][1]')
    item['genre'] = film.xpath('.//span[class="genres"]/text()').extract()
    item['plot'] = film.xpath('.//p[class="overview"]/text()').extract()
    item['rating'] = film.xpath('//span[class="vote_average"]/text()').extract()
    item['release_date'] = film.xpath('.//span[class="release_date"]/text()').extract()

    return item

但是我收到错误忽略响应< 404 https://www.themoviedb.org/>:未处理或不允许HTTP状态代码。所以我不知道如何解决这个问题。

编辑:我更改了代码以使用unidecode从标题中删除特殊字符,并用加号替换空格。我现在也在FormRequest的formdata中传递一个字符串。但我一直收到相同的DEBUG警告,并且没有从themoviedb.org中删除任何内容。

我还添加了日志:

2015-10-27 21:09:39 [scrapy] INFO: Scrapy 1.0.3 started (bot: topfilms)
2015-10-27 21:09:39 [scrapy] INFO: Optional features available: ssl, http11
2015-10-27 21:09:39 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'topfilms.spiders', 'SPIDER_MODULES': ['topfilms.spiders'], 'BOT_NAME': 'topfilms'}
2015-10-27 21:09:39 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-10-27 21:09:39 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-10-27 21:09:39 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-10-27 21:09:39 [scrapy] INFO: Enabled item pipelines: StoreInDBPipeline
2015-10-27 21:09:39 [scrapy] INFO: Spider opened
2015-10-27 21:09:39 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-10-27 21:09:39 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-10-27 21:09:39 [scrapy] DEBUG: Redirecting (301) to <GET http://www.nieuwsblad.be/tv-gids/vandaag/film> from <GET http://www.nieuwsblad.be/tv-gids/vandaag/film/>
2015-10-27 21:09:39 [scrapy] DEBUG: Crawled (200) <GET http://www.nieuwsblad.be/tv-gids/vandaag/film> (referer: None)
2015-10-27 21:09:40 [scrapy] DEBUG: Filtered duplicate request: <POST https://www.themoviedb.org/> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2015-10-27 21:09:40 [scrapy] DEBUG: Crawled (404) <POST https://www.themoviedb.org/> (referer: http://www.nieuwsblad.be/tv-gids/vandaag/film)
2015-10-27 21:09:40 [scrapy] DEBUG: Crawled (404) <POST https://www.themoviedb.org/> (referer: http://www.nieuwsblad.be/tv-gids/vandaag/film)

<... SAME MESSAGE REPEATED MULTIPLE TIMES ...>

2015-10-27 21:09:41 [scrapy] DEBUG: Ignoring response <404 https://www.themoviedb.org/>: HTTP status code is not handled or not allowed
2015-10-27 21:09:41 [scrapy] INFO: Closing spider (finished)
2015-10-27 21:09:41 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 34287,
 'downloader/request_count': 79,
 'downloader/request_method_count/GET': 2,
 'downloader/request_method_count/POST': 77,
 'downloader/response_bytes': 427844,
 'downloader/response_count': 79,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/301': 1,
 'downloader/response_status_count/404': 77,
 'dupefilter/filtered': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 10, 27, 20, 9, 41, 943855),
 'log_count/DEBUG': 158,
 'log_count/INFO': 7,
 'request_depth_max': 1,
 'response_received_count': 78,
 'scheduler/dequeued': 79,
 'scheduler/dequeued/memory': 79,
 'scheduler/enqueued': 79,
 'scheduler/enqueued/memory': 79,
 'start_time': datetime.datetime(2015, 10, 27, 20, 9, 39, 502545)}
2015-10-27 21:09:41 [scrapy] INFO: Spider closed (finished)

1 个答案:

答案 0 :(得分:1)

第一个选择器xpath返回一个列表,因此item['title']中的内容是一个列表,您将传递给formdata={'query': item['title']}

现在,如果你仍然需要解析未处理的状态(比如仍然能够解析400状态的respone),你应该在请求中使用errback参数,如下所示:

...
    yield scrapy.FormRequest(url="https://www.themoviedb.org/", 
                             formdata={'query': item['title']},
                             callback=self.parse_tmdb,
                             errback=self.parse_error,
                             meta = {'item': item})
def parse_error(self, response):
    # do your magic
    ...
...