Scrapy CrawlSpider没有关注链接

时间:2015-06-09 03:23:55

标签: python web-scraping web-crawler scrapy scrapy-spider

我正在尝试抓取此类别页面上提供的所有(#123)详细信息页面中的某些属性 - http://stinkybklyn.com/shop/cheese/但是scrapy无法遵循我设置的链接模式,我检查了scrapy文档和一些教程好吧,但没有运气!

以下是代码:

import scrapy

from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule

class Stinkybklyn(CrawlSpider):
    name = "Stinkybklyn"
    allowed_domains = ["stinkybklyn.com"]
    start_urls = [
        "http://stinkybklyn.com/shop/cheese/chandoka",
    ]
    Rule(LinkExtractor(allow=r'\/shop\/cheese\/.*'),
         callback='parse_items', follow=True)


    def parse_items(self, response):
        print "response", response
        hxs= HtmlXPathSelector(response)
        title=hxs.select("//*[@id='content']/div/h4").extract()
        title="".join(title)
        title=title.strip().replace("\n","").lstrip()
        print "title is:",title

有人可以告诉我在这里做错了吗?

2 个答案:

答案 0 :(得分:1)

您的代码的主要问题是您尚未为CrawlSpider 设置rules

我建议的其他改进:

  • 无需实例化HtmlXPathSelector,您可以直接使用response
  • select()现已弃用,请使用xpath()
  • 获取text()元素的title以便检索,例如,获取Chandoka而不是<h4>Chandoka</h4>
  • 我认为你的意思是从奶酪店目录页开始:http://stinkybklyn.com/shop/cheese

包含已应用改进的完整代码:

from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule


class Stinkybklyn(CrawlSpider):
    name = "Stinkybklyn"
    allowed_domains = ["stinkybklyn.com"]

    start_urls = [
        "http://stinkybklyn.com/shop/cheese",
    ]

    rules = [
        Rule(LinkExtractor(allow=r'\/shop\/cheese\/.*'), callback='parse_items', follow=True)
    ]

    def parse_items(self, response):
        title = response.xpath("//*[@id='content']/div/h4/text()").extract()
        title = "".join(title)
        title = title.strip().replace("\n", "").lstrip()
        print "title is:", title

答案 1 :(得分:0)

好像你有一些语法错误。 试试这个,

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector


class Stinkybklyn(CrawlSpider):
    name = "Stinkybklyn"
    allowed_domains = ["stinkybklyn.com"]
    start_urls = [
        "http://stinkybklyn.com/shop/cheese/",
    ]

    rules = (
            Rule(LinkExtractor(allow=(r'/shop/cheese/')), callback='parse_items'),

        )

    def parse_items(self, response):
    print "response", response