SgmlLinkExtractor不显示结果或以下链接

时间:2015-03-12 18:46:40

标签: python web-crawler scrapy scrapy-spider sgml

我在完全理解SGML Link Extractor如何工作方面遇到了问题。使用Scrapy创建爬虫时,我可以使用特定的URL成功从链接中提取数据。问题是使用规则来跟踪特定URL中的下一页链接。

我认为问题在于allow()属性。将规则添加到代码中时,结果不会显示在命令行中,也不会跟踪指向下一页的链接。

非常感谢任何帮助。

这是代码......

import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider
from scrapy.contrib.spiders import Rule

from tutorial.items import TutorialItem

class AllGigsSpider(CrawlSpider):
    name = "allGigs"
    allowed_domains = ["http://www.allgigs.co.uk/"]
    start_urls = [
        "http://www.allgigs.co.uk/whats_on/London/clubbing-1.html",
        "http://www.allgigs.co.uk/whats_on/London/festivals-1.html",
        "http://www.allgigs.co.uk/whats_on/London/comedy-1.html",
        "http://www.allgigs.co.uk/whats_on/London/theatre_and_opera-1.html",
        "http://www.allgigs.co.uk/whats_on/London/dance_and_ballet-1.html"
    ]    
    rules = (Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//div[@class="more"]',)), callback="parse_me", follow= True),
    )

    def parse_me(self, response):
        hxs = HtmlXPathSelector(response)
        infos = hxs.xpath('//div[@class="entry vevent"]')
        items = []
        for info in infos:
            item = TutorialItem()
            item ['artist'] = hxs.xpath('//span[@class="summary"]//text()').extract()
            item ['date'] = hxs.xpath('//abbr[@class="dtstart dtend"]//text()').extract()
            item ['endDate'] = hxs.xpath('//abbr[@class="dtend"]//text()').extract()            
            item ['startDate'] = hxs.xpath('//abbr[@class="dtstart"]//text()').extract()
            items.append(item)
        return items
        print items

1 个答案:

答案 0 :(得分:0)

问题出在restrict_xpaths - 它应该指向链接提取器应该查找链接的块。不要指定allow

rules = [
    Rule(SgmlLinkExtractor(restrict_xpaths='//div[@class="more"]'), 
         callback="parse_me", 
         follow=True),
]

您需要修复allowed_domains

allowed_domains = ["www.allgigs.co.uk"]

另请注意,print items回调中的parse_me()无法访问,因为它位于return语句之后。并且,在循环中,您不应该使用hxs应用XPath表达式,表达式应该在info上下文中使用。您可以简化parse_me()

def parse_me(self, response):
    for info in response.xpath('//div[@class="entry vevent"]'):
        item = TutorialItem()
        item['artist'] = info.xpath('.//span[@class="summary"]//text()').extract()
        item['date'] = info.xpath('.//abbr[@class="dtstart dtend"]//text()').extract()
        item['endDate'] = info.xpath('.//abbr[@class="dtend"]//text()').extract()            
        item['startDate'] = info.xpath('.//abbr[@class="dtstart"]//text()').extract()
        yield item