我试图抓取以下蜘蛛:
import scrapy
from apkmirror.items import ApkmirrorItem
class ApkmirrorScraperSpider(scrapy.Spider):
name = "apkmirror-scraper"
allowed_domains = ["apkmirror.com"]
custom_settings = {'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36'}
start_urls = ['https://www.apkmirror.com/apk/google-inc/youtube/youtube-12-19-56-release/youtube-12-19-56-android-apk-download/']
def parse(self, response):
item = ApkmirrorItem()
download_page_url = response.urljoin("download/") # We assume that the 'actual' download page follows this naming convention. (This could also be extracted using response.css('.downloadButton').xpath('.//@href')).
request = scrapy.Request(download_page_url, callback=self.parse_download_page)
request.meta['item'] = item
yield request
def parse_download_page(self, response):
'''Get the alternative download link from the 'actual' download page.'''
item = response.meta['item']
download_relative_url = response.xpath('//*[contains(text(), "Your download will start immediately.")]/a/@href').extract_first()
download_url = response.urljoin(download_relative_url)
item['file_urls'] = [download_url]
yield item
其中items.py
是
import scrapy
class ApkmirrorItem(scrapy.Item):
file_urls = scrapy.Field()
files = scrapy.Field()
和settings.py
包括激活文件管道:
ITEM_PIPELINES = {
'scrapy.pipelines.files.FilesPipeline': 1
}
FILES_STORE = '/tmp/apkmirror_test/files'
但是,由于日志中的302重定向,我收到了WARNING
:
2017-05-23 12:13:51 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: apkmirror)
2017-05-23 12:13:51 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'apkmirror', 'NEWSPIDER_MODULE': 'apkmirror.spiders', 'SPIDER_MODULES': ['apkmirror.spiders']}
2017-05-23 12:13:52 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2017-05-23 12:13:52 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-05-23 12:13:52 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-05-23 12:13:52 [scrapy.middleware] INFO: Enabled item pipelines:
['scrapy.pipelines.files.FilesPipeline']
2017-05-23 12:13:52 [scrapy.core.engine] INFO: Spider opened
2017-05-23 12:13:52 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-05-23 12:13:52 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-05-23 12:13:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.apkmirror.com/apk/google-inc/youtube/youtube-12-19-56-release/youtube-12-19-56-android-apk-download/> (referer: None)
2017-05-23 12:13:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.apkmirror.com/apk/google-inc/youtube/youtube-12-19-56-release/youtube-12-19-56-android-apk-download/download/> (referer: https://www.apkmirror.com/apk/google-inc/youtube/youtube-12-19-56-release/youtube-12-19-56-android-apk-download/)
2017-05-23 12:13:58 [scrapy.core.engine] DEBUG: Crawled (302) <GET https://www.apkmirror.com/wp-content/themes/APKMirror/download.php?id=215041> (referer: None)
2017-05-23 12:13:58 [scrapy.pipelines.files] WARNING: File (code: 302): Error downloading file from <GET https://www.apkmirror.com/wp-content/themes/APKMirror/download.php?id=215041> referred in <None>
2017-05-23 12:13:59 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.apkmirror.com/apk/google-inc/youtube/youtube-12-19-56-release/youtube-12-19-56-android-apk-download/download/>
{'file_urls': ['https://www.apkmirror.com/wp-content/themes/APKMirror/download.php?id=215041'],
'files': []}
2017-05-23 12:13:59 [scrapy.core.engine] INFO: Closing spider (finished)
2017-05-23 12:13:59 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1336,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 62710,
'downloader/response_count': 3,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 5, 23, 12, 13, 59, 51739),
'item_scraped_count': 1,
'log_count/DEBUG': 5,
'log_count/INFO': 7,
'log_count/WARNING': 1,
'memusage/max': 47157248,
'memusage/startup': 47157248,
'request_depth_max': 1,
'response_received_count': 3,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2017, 5, 23, 12, 13, 52, 187141)}
2017-05-23 12:13:59 [scrapy.core.engine] INFO: Spider closed (finished)
并且文件未下载。
似乎存在一个问题(https://github.com/scrapy/scrapy/issues/2004),应该在Scrapy版本1.4.0中修复。但是,我非常确定我正在运行1.4,而且我仍然遇到此错误。我该如何解决?
其他信息 我发现使用命令
很有帮助scrapy shell https://www.apkmirror.com/wp-content/themes/APKMirror/download.php?id=215041 -s USER_AGENT="Mozilla"
在Scrapy shell启动之前导致以下日志:
2017-05-23 13:56:10 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2017-05-23 13:56:10 [scrapy.utils.log] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0, 'USER_AGENT': 'Mozilla'}
2017-05-23 13:56:10 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole']
2017-05-23 13:56:10 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-05-23 13:56:10 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-05-23 13:56:10 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-05-23 13:56:10 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-05-23 13:56:10 [scrapy.core.engine] INFO: Spider opened
2017-05-23 13:56:11 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.apkmirror.com/wp-content/uploads/uploaded/591e9ab20113f/com.google.android.youtube_12.19.56-1219563340_minAPI21(armeabi-v7a)(480dpi)_apkmirror.com.apk> from <GET https://www.apkmirror.com/wp-content/themes/APKMirror/download.php?id=215041>
2017-05-23 13:56:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.apkmirror.com/wp-content/uploads/uploaded/591e9ab20113f/com.google.android.youtube_12.19.56-1219563340_minAPI21(armeabi-v7a)(480dpi)_apkmirror.com.apk> (referer: None)
2017-05-23 13:56:17 [traitlets] DEBUG: Using default logger
2017-05-23 13:56:17 [traitlets] DEBUG: Using default logger
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x7f67f9424438>
[s] item {}
[s] request <GET https://www.apkmirror.com/wp-content/themes/APKMirror/download.php?id=215041>
[s] response <200 https://www.apkmirror.com/wp-content/uploads/uploaded/591e9ab20113f/com.google.android.youtube_12.19.56-1219563340_minAPI21(armeabi-v7a)(480dpi)_apkmirror.com.apk>
[s] settings <scrapy.settings.Settings object at 0x7f67f0ae19b0>
[s] spider <DefaultSpider 'default' at 0x7f67f06ddbe0>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
In [1]:
记录的是,包含?php
的给定网址被重定向到https://www.apkmirror.com/wp-content/uploads/uploaded/591e9ab20113f/com.google.android.youtube_12.19.56-1219563340_minAPI21(armeabi-v7a)(480dpi)_apkmirror.com.apk
,这是我要下载的实际文件。可能会以类似的方式重定向files_url
吗?
答案 0 :(得分:4)
根据文档(https://doc.scrapy.org/en/latest/topics/media-pipeline.html#allowing-redirections),您必须设置
MEDIA_ALLOW_REDIRECTS = True
<{1>}中的,对我有用。