scrapy LinkExtractor不提取正确的URL

时间:2015-12-16 01:06:20

标签: python web-scraping scrapy web-crawler

我正在使用Scrapy抓取网站。我的start_url是一个包含很多页面的搜索结果。当我使用LinkExtractor时,它会为我想要的网址添加更多内容。所以我只能抓取start_url,所有其他被污染的网址都会获得404.

2015-12-15 20:38:43 [scrapy] INFO: Spider opened
2015-12-15 20:38:43 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped     0 items (at 0 items/min)
2015-12-15 20:38:43 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-12-15 20:38:44 [scrapy] DEBUG: Crawled (200) <GET http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93> (referer: None)
2015-12-15 20:38:50 [scrapy] DEBUG: Crawled (404) <GET http://task.zhubajie.com/success/%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20http://task.zhubajie.com/success/p2.html?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93++++++++++++++++> (referer: http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93)
2015-12-15 20:38:50 [scrapy] DEBUG: Ignoring response <404 http://task.zhubajie.com/success/%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20http://task.zhubajie.com/success/p2.htmlkw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93++++++++++++++++>: HTTP status code is not handled or not allowed
...
2015-12-15 20:39:18 [scrapy] INFO: Closing spider (finished)
2015-12-15 20:39:18 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2578,
 'downloader/request_count': 6,
 'downloader/request_method_count/GET': 6,
 'downloader/response_bytes': 57627,
 'downloader/response_count': 6,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 5,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 12, 15, 12, 39, 18, 70000),
 'log_count/DEBUG': 12,
 'log_count/INFO': 7,
 'log_count/WARNING': 2,
 'request_depth_max': 1,
 'response_received_count': 6,
 'scheduler/dequeued': 6,
 'scheduler/dequeued/memory': 6,
 'scheduler/enqueued': 6,
 'scheduler/enqueued/memory': 6,
 'start_time': datetime.datetime(2015, 12, 15, 12, 38, 43, 693000)}
2015-12-15 20:39:18 [scrapy] INFO: Spider closed (finished)

我想要

http://task.zhubajie.com/success/p2.html?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93 

除了:

http://task.zhubajie.com/success/%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20http://task.zhubajie.com/success/p2.html?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93++++++++++++++++". 

我不知道是什么原因引起的。有人可以帮帮我吗?

start_urls = [
    'http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93',
]

rules = [
    #Rule(LinkExtractor(allow=(r'task.zhubajie.com/success/p\d+\.html',), callback='parse_item', follow=True),
    Rule(LinkExtractor(restrict_xpaths=('//div[@class="pagination"]')), callback='parse_item', follow=True)
]

编辑: 我试着像这样使用process_value。

 Rule(LinkExtractor(restrict_xpaths=('//div[@class="pagination"]'),     process_value=lambda x: x.strip()), callback='parse_item', follow=True)

和此:

    def process_0(value):
      m = re.search('http://task.zhubajie.com/success/%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20', value)
      if m:
        return m.strip('http://task.zhubajie.com/success/%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20')

它们都不起作用。它们都有相同的日志,并访问错误的URL。

1 个答案:

答案 0 :(得分:0)

分页符中的所有链接都有很多空格http://screencloud.net/v/qQLW。您应该能够在使用以下代码获得结果之前预处理报废值:

# coding: utf-8
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


def process_value(v):
    v1 = v.split()[-1]
    if v1.startswith('http'):
        v = v1
    return v


class MySpider(CrawlSpider):
    name = 'spider'
    start_urls = [
        'http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93'
    ]
    rules = [
        Rule(LinkExtractor(restrict_xpaths=('//div[@class="pagination"]'),
                           process_value=process_value), follow=True)
    ]

蜘蛛输出:

2015-12-18 10:35:37 [scrapy] INFO: Scrapy 1.0.3 started (bot: scrapybot)
2015-12-18 10:35:37 [scrapy] INFO: Optional features available: ssl, http11
2015-12-18 10:35:37 [scrapy] INFO: Overridden settings: {}
2015-12-18 10:35:37 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-12-18 10:35:37 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-12-18 10:35:37 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-12-18 10:35:37 [scrapy] INFO: Enabled item pipelines: 
2015-12-18 10:35:37 [scrapy] INFO: Spider opened
2015-12-18 10:35:37 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-12-18 10:35:37 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-12-18 10:35:38 [scrapy] DEBUG: Crawled (200) <GET http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93> (referer: None)
2015-12-18 10:35:40 [scrapy] DEBUG: Crawled (200) <GET http://task.zhubajie.com/success/p4.html?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93> (referer: http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93)
2015-12-18 10:35:40 [scrapy] DEBUG: Crawled (200) <GET http://task.zhubajie.com/success/p6.html?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93> (referer: http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93)
2015-12-18 10:35:40 [scrapy] DEBUG: Filtered duplicate request: <GET http://task.zhubajie.com/success/p3.html?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2015-12-18 10:35:41 [scrapy] DEBUG: Crawled (200) <GET http://task.zhubajie.com/success/p3.html?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93> (referer: http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93)
2015-12-18 10:35:41 [scrapy] DEBUG: Crawled (200) <GET http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93> (referer: http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93)
2015-12-18 10:35:47 [scrapy] DEBUG: Crawled (200) <GET http://task.zhubajie.com/success/p5.html?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93> (referer: http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93)
2015-12-18 10:35:54 [scrapy] DEBUG: Crawled (200) <GET http://task.zhubajie.com/success/p2.html?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93> (referer: http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93)
2015-12-18 10:35:54 [scrapy] INFO: Closing spider (finished)
2015-12-18 10:35:54 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2380,
 'downloader/request_count': 7,
 'downloader/request_method_count/GET': 7,
 'downloader/response_bytes': 196525,
 'downloader/response_count': 7,
 'downloader/response_status_count/200': 7,
 'dupefilter/filtered': 36,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 12, 18, 7, 35, 54, 945271),
 'log_count/DEBUG': 9,
 'log_count/INFO': 7,
 'request_depth_max': 2,
 'response_received_count': 7,
 'scheduler/dequeued': 7,
 'scheduler/dequeued/memory': 7,
 'scheduler/enqueued': 7,
 'scheduler/enqueued/memory': 7,
 'start_time': datetime.datetime(2015, 12, 18, 7, 35, 37, 907281)}
2015-12-18 10:35:54 [scrapy] INFO: Spider closed (finished)

LinkExtractor docs