我在完全理解SGML Link Extractor如何工作方面遇到了问题。使用Scrapy创建爬虫时,我可以使用特定的URL成功从链接中提取数据。问题是使用规则来跟踪特定URL中的下一页链接。
我认为问题在于allow()
属性。将规则添加到代码中时,结果不会显示在命令行中,也不会跟踪指向下一页的链接。
非常感谢任何帮助。
这是代码......
import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider
from scrapy.contrib.spiders import Rule
from tutorial.items import TutorialItem
class AllGigsSpider(CrawlSpider):
name = "allGigs"
allowed_domains = ["http://www.allgigs.co.uk/"]
start_urls = [
"http://www.allgigs.co.uk/whats_on/London/clubbing-1.html",
"http://www.allgigs.co.uk/whats_on/London/festivals-1.html",
"http://www.allgigs.co.uk/whats_on/London/comedy-1.html",
"http://www.allgigs.co.uk/whats_on/London/theatre_and_opera-1.html",
"http://www.allgigs.co.uk/whats_on/London/dance_and_ballet-1.html"
]
rules = (Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//div[@class="more"]',)), callback="parse_me", follow= True),
)
def parse_me(self, response):
hxs = HtmlXPathSelector(response)
infos = hxs.xpath('//div[@class="entry vevent"]')
items = []
for info in infos:
item = TutorialItem()
item ['artist'] = hxs.xpath('//span[@class="summary"]//text()').extract()
item ['date'] = hxs.xpath('//abbr[@class="dtstart dtend"]//text()').extract()
item ['endDate'] = hxs.xpath('//abbr[@class="dtend"]//text()').extract()
item ['startDate'] = hxs.xpath('//abbr[@class="dtstart"]//text()').extract()
items.append(item)
return items
print items
答案 0 :(得分:0)
问题出在restrict_xpaths
- 它应该指向链接提取器应该查找链接的块。不要指定allow
:
rules = [
Rule(SgmlLinkExtractor(restrict_xpaths='//div[@class="more"]'),
callback="parse_me",
follow=True),
]
您需要修复allowed_domains
:
allowed_domains = ["www.allgigs.co.uk"]
另请注意,print items
回调中的parse_me()
无法访问,因为它位于return
语句之后。并且,在循环中,您不应该使用hxs
应用XPath表达式,表达式应该在info
上下文中使用。您可以简化parse_me()
:
def parse_me(self, response):
for info in response.xpath('//div[@class="entry vevent"]'):
item = TutorialItem()
item['artist'] = info.xpath('.//span[@class="summary"]//text()').extract()
item['date'] = info.xpath('.//abbr[@class="dtstart dtend"]//text()').extract()
item['endDate'] = info.xpath('.//abbr[@class="dtend"]//text()').extract()
item['startDate'] = info.xpath('.//abbr[@class="dtstart"]//text()').extract()
yield item