我正在使用Scrapy抓取网站。我的start_url是一个包含很多页面的搜索结果。当我使用LinkExtractor时,它会为我想要的网址添加更多内容。所以我只能抓取start_url,所有其他被污染的网址都会获得404.
2015-12-15 20:38:43 [scrapy] INFO: Spider opened
2015-12-15 20:38:43 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-12-15 20:38:43 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-12-15 20:38:44 [scrapy] DEBUG: Crawled (200) <GET http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93> (referer: None)
2015-12-15 20:38:50 [scrapy] DEBUG: Crawled (404) <GET http://task.zhubajie.com/success/%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20http://task.zhubajie.com/success/p2.html?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93++++++++++++++++> (referer: http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93)
2015-12-15 20:38:50 [scrapy] DEBUG: Ignoring response <404 http://task.zhubajie.com/success/%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20http://task.zhubajie.com/success/p2.htmlkw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93++++++++++++++++>: HTTP status code is not handled or not allowed
...
2015-12-15 20:39:18 [scrapy] INFO: Closing spider (finished)
2015-12-15 20:39:18 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2578,
'downloader/request_count': 6,
'downloader/request_method_count/GET': 6,
'downloader/response_bytes': 57627,
'downloader/response_count': 6,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 5,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 12, 15, 12, 39, 18, 70000),
'log_count/DEBUG': 12,
'log_count/INFO': 7,
'log_count/WARNING': 2,
'request_depth_max': 1,
'response_received_count': 6,
'scheduler/dequeued': 6,
'scheduler/dequeued/memory': 6,
'scheduler/enqueued': 6,
'scheduler/enqueued/memory': 6,
'start_time': datetime.datetime(2015, 12, 15, 12, 38, 43, 693000)}
2015-12-15 20:39:18 [scrapy] INFO: Spider closed (finished)
我想要
http://task.zhubajie.com/success/p2.html?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93
除了:
http://task.zhubajie.com/success/%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20http://task.zhubajie.com/success/p2.html?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93++++++++++++++++".
我不知道是什么原因引起的。有人可以帮帮我吗?
start_urls = [
'http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93',
]
rules = [
#Rule(LinkExtractor(allow=(r'task.zhubajie.com/success/p\d+\.html',), callback='parse_item', follow=True),
Rule(LinkExtractor(restrict_xpaths=('//div[@class="pagination"]')), callback='parse_item', follow=True)
]
编辑: 我试着像这样使用process_value。
Rule(LinkExtractor(restrict_xpaths=('//div[@class="pagination"]'), process_value=lambda x: x.strip()), callback='parse_item', follow=True)
和此:
def process_0(value):
m = re.search('http://task.zhubajie.com/success/%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20', value)
if m:
return m.strip('http://task.zhubajie.com/success/%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20')
它们都不起作用。它们都有相同的日志,并访问错误的URL。
答案 0 :(得分:0)
分页符中的所有链接都有很多空格http://screencloud.net/v/qQLW。您应该能够在使用以下代码获得结果之前预处理报废值:
# coding: utf-8
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
def process_value(v):
v1 = v.split()[-1]
if v1.startswith('http'):
v = v1
return v
class MySpider(CrawlSpider):
name = 'spider'
start_urls = [
'http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93'
]
rules = [
Rule(LinkExtractor(restrict_xpaths=('//div[@class="pagination"]'),
process_value=process_value), follow=True)
]
蜘蛛输出:
2015-12-18 10:35:37 [scrapy] INFO: Scrapy 1.0.3 started (bot: scrapybot)
2015-12-18 10:35:37 [scrapy] INFO: Optional features available: ssl, http11
2015-12-18 10:35:37 [scrapy] INFO: Overridden settings: {}
2015-12-18 10:35:37 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-12-18 10:35:37 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-12-18 10:35:37 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-12-18 10:35:37 [scrapy] INFO: Enabled item pipelines:
2015-12-18 10:35:37 [scrapy] INFO: Spider opened
2015-12-18 10:35:37 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-12-18 10:35:37 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-12-18 10:35:38 [scrapy] DEBUG: Crawled (200) <GET http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93> (referer: None)
2015-12-18 10:35:40 [scrapy] DEBUG: Crawled (200) <GET http://task.zhubajie.com/success/p4.html?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93> (referer: http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93)
2015-12-18 10:35:40 [scrapy] DEBUG: Crawled (200) <GET http://task.zhubajie.com/success/p6.html?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93> (referer: http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93)
2015-12-18 10:35:40 [scrapy] DEBUG: Filtered duplicate request: <GET http://task.zhubajie.com/success/p3.html?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2015-12-18 10:35:41 [scrapy] DEBUG: Crawled (200) <GET http://task.zhubajie.com/success/p3.html?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93> (referer: http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93)
2015-12-18 10:35:41 [scrapy] DEBUG: Crawled (200) <GET http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93> (referer: http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93)
2015-12-18 10:35:47 [scrapy] DEBUG: Crawled (200) <GET http://task.zhubajie.com/success/p5.html?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93> (referer: http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93)
2015-12-18 10:35:54 [scrapy] DEBUG: Crawled (200) <GET http://task.zhubajie.com/success/p2.html?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93> (referer: http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93)
2015-12-18 10:35:54 [scrapy] INFO: Closing spider (finished)
2015-12-18 10:35:54 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2380,
'downloader/request_count': 7,
'downloader/request_method_count/GET': 7,
'downloader/response_bytes': 196525,
'downloader/response_count': 7,
'downloader/response_status_count/200': 7,
'dupefilter/filtered': 36,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 12, 18, 7, 35, 54, 945271),
'log_count/DEBUG': 9,
'log_count/INFO': 7,
'request_depth_max': 2,
'response_received_count': 7,
'scheduler/dequeued': 7,
'scheduler/dequeued/memory': 7,
'scheduler/enqueued': 7,
'scheduler/enqueued/memory': 7,
'start_time': datetime.datetime(2015, 12, 18, 7, 35, 37, 907281)}
2015-12-18 10:35:54 [scrapy] INFO: Spider closed (finished)