Crawler返回空结果

时间:2014-05-16 22:22:37

标签: python python-2.7 scrapy

我已经为this page构建了(使用stackoverflow的帮助)抓取工具,但结果是空白的。虽然单页蜘蛛可以工作并擦除所有必需的项目,但下一页的爬虫不会。我不明白这里有什么问题。

以下是抓取工具:

from scrapy.item import Item, Field
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
import re
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin


from mymobile.items import MymobileItem


class MmobySpider(CrawlSpider):
    name = "mmoby2" 
    allowed_domains = "http://mymobile.ge"
    start_urls = [
        "http://mymobile.ge/new/v2.php?cat=69"
    ]

    rule = (Rule(SgmlLinkExtractor(allow=("new/v2.php?cat=69&pnum=\d*", ))
        , callback="parse_items", follow=True),)

    def parse_items(self, response):
        sel = Selector(response)
        titles = sel.xpath('//table[@width="1000"]//td/table[@class="probg"]')
        items = []
        for t in titles:
            url = sel.xpath('tr//a/@href').extract()
            item = MymobileItem()
            item["brand"] = t.xpath('tr[2]/td/text()').re('^([\w\-]+)')
            item["model"] = t.xpath('tr[2]/td/text()').re('\s+(.*)$')
            item["price"] = t.xpath('tr[3]/td//text()').re('^([0-9\.]+)')
            item["url"] = urljoin("http://mymobile.ge/new", url[0])

            items.append(item)

        return(items)   

1 个答案:

答案 0 :(得分:1)

两个主要问题:

  • 该属性名为rules,而不是rule

    rules = (Rule(SgmlLinkExtractor(allow=("new/v2.php?cat=69&pnum=\d*", )), 
                  callback="parse_items", 
                  follow=True), )
    
  • allowed_domains应该是一个列表:

    allowed_domains = ["mymobile.ge"]
    

此外,您需要像Paul在评论中建议的那样调整正则表达式。