Scrapy Spider卡在爬行中间

时间:2014-07-21 05:06:29

标签: scrapy scrapy-spider

我是scrapy的新手,我正在尝试构建一个爬行网站并从中获取所有电话号码,电子邮件,pdf等的蜘蛛(我希望它能够跟随所有来自主页面的链接,以便搜索整个域名)。

此问题存在类似问题,但尚未解决:Why scrapy crawler stops?

这是我蜘蛛的代码:

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from mobilesuites.items import MobilesuitesItem
import re

class ExampleSpider(CrawlSpider):
    name = "hyatt"
    allowed_domains = ["hyatt.com"]
    start_urls = ( 
        'http://www.hyatt.com/',
    )   

    #follow only non-javascript links
    rules = ( 
            Rule(SgmlLinkExtractor(deny = ('.*\.jsp.*')), follow = True, callback = 'parse_item'),
            )   

    def parse_item(self, response):
        #self.log('The current url is %s' % response.url)

        selector = Selector(response)
        item = MobilesuitesItem()
        #get url
        item['url'] = response.url

        #get page title
        titles = selector.select("//title")
        for t in titles:
            item['title'] = t.select("./text()").extract()

        #get all phone numbers, emails, and pdf links
        text = response.body
        item['phone'] = '|'.join(re.findall('\d{3}[-\.\s]\d{3}[-\.\s]\d{4}|\(?\d{3}\)?[-\.\s]?\d{3}[-\.\s]\d{4}|\d{3}[-\.\s]\d{4}', text))
        item['email'] = '|'.join(re.findall("[^\s@]+@[^\s@]+\.[^\s@]+", text))
        item['pdfs'] = '|'.join(re.findall("[^\s\"<]*\.pdf[^\s\">]*", text))

        #check to see if dining is mentioned on the page
        item['dining'] = bool(re.findall("\s[dD]ining\s|\s[mM]enu\s|\s[bB]everage\s", text))
        return item

这是抓取挂起之前抓取日志的最后一部分:

2014-07-21 18:18:57-0500 [hyatt] DEBUG: Scraped from <200 http://www.place.hyatt.com/en/hyattplace/eat-and-drink/24-7-gallery-menu.html>
    {'email': '',
     'phone': '',
     'title': [u'24/7 Gallery Menu'],
     'url': 'http://www.place.hyatt.com/en/hyattplace/eat-and-drink/24-7-gallery-menu.html'}
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Ignoring response <404 http://hyatt.com/gallery/thrive/siteMap.html>: HTTP status code is not handled or not allowed
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.hyatt.com/hyatt/pure/contact/> (referer: http://www.hyatt.com/hyatt/pure/?icamp=HY_HyattPure_HPLS)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.house.hyatt.com/en/hyatthouse/aboutus.html> (referer: http://www.house.hyatt.com/en/hyatthouse.html)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.place.hyatt.com/en/hyattplace/eat-and-drink/eat-and-drink.html> (referer: http://www.place.hyatt.com/en/hyattplace/eat-and-drink.html)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.park.hyatt.com/en/parkhyatt/newsandannouncements.html?icamp=park_hppsa_new_hotels> (referer: http://www.park.hyatt.com/en/parkhyatt.html)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.regency.hyatt.com/en/hyattregency/meetingsandevents.html> (referer: http://www.regency.hyatt.com/en/hyattregency.html)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.house.hyatt.com/en/hyatthouse/specialoffers.html> (referer: http://www.house.hyatt.com/en/hyatthouse.html)
2014-07-21 18:18:57-0500 [hyatt] DEBUG: Crawled (200) <GET http://www.house.hyatt.com/en/hyatthouse/locations.html> (referer: http://www.house.hyatt.com/en/hyatthouse.html)

0 个答案:

没有答案