获取Scrapy Url正则表达式

时间:2014-07-09 14:07:07

标签: python scrapy

我已经在Scrapy中写了一个蜘蛛,它基本上做得很好并完全按照它应该做的去做。但是当我执行scrapy crawl时,问题出现在日志中

    # -*- coding: utf-8 -*-
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from ecommerce.items import ArticleItem


class WikiSpider(CrawlSpider):
    name = 'wiki'
    start_urls = (
    'http://www.wiki.tn/index.php',
    )
    rules= [Rule(SgmlLinkExtractor(allow=[r'\w+\/\d{1,4}\/\d{1,4}\/\d{1,4}\X+']),follow=True,     callback='parse_Article_wiki'),
]

    def parse_Article_wiki(self, response):
        hxs = HtmlXPathSelector(response)
        item = ArticleItem()

        print '*******************>> '+response.url

但是,当我执行蜘蛛时,它显示我的工作原理:

    2014-07-09 15:03:13+0100 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, 
        OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-07-09 15:03:13+0100 [scrapy] INFO: Enabled item pipelines: 
2014-07-09 15:03:13+0100 [wiki] INFO: Spider opened
2014-07-09 15:03:13+0100 [wiki] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0     items/min)
2014-07-09 15:03:13+0100 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2014-07-09 15:03:13+0100 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-07-09 15:03:13+0100 [wiki] DEBUG: Crawled (200) <GET http://www.wiki.tn/index.php> (referer:     None)
2014-07-09 15:03:13+0100 [wiki] INFO: Closing spider (finished)
2014-07-09 15:03:13+0100 [wiki] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 219,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 13062,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2014, 7, 9, 14, 3, 13, 416073),
     'log_count/DEBUG': 3,
     'log_count/INFO': 7,
     'response_received_count': 1,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
        'scheduler/enqueued': 1,
        'scheduler/enqueued/memory': 1,
     'start_time': datetime.datetime(2014, 7, 9, 14, 3, 13, 210430)}
2014-07-09 15:03:13+0100 [wiki] INFO: Spider closed (finished)

1 个答案:

答案 0 :(得分:0)

我不确定你的问题是什么。猜测,我认为你想要抓取网站,而不是这样做。

如果这是问题,它可以是您在规则定义中使用的正则表达式。你想要关注什么样的链接?

另一方面,我还建议您使用allowed_domains变量,在您的情况下将是:

allowed_domains = ['wiki.tn']