我正在尝试让scrapy蜘蛛工作,但是SgmlLinkExtractor似乎存在问题。
这是签名:
SgmlLinkExtractor(allow =(),deny =(),allow_domains =(),deny_domains =(),restrict_xpaths(),tags =('a','area'),attrs =('href'),canonicalize = True,unique = True,process_value = None)
我使用的是allow()选项,这是我的代码:
start_urls = ['http://bigbangtrans.wordpress.com']
rules = [Rule(SgmlLinkExtractor(allow=[r'series-\d{1}-episode-\d{2}.']), callback='parse_item')]
示例网址看起来像http://bigbangtrans.wordpress.com/series-1-episode-11-the-pancake-batter-anomaly/
scrapy crawl tbbt
的输出包含
[tbbt] DEBUG:Crawled(200)http://bigbangtrans.wordpress.com/series-3-episode-17-the-precious-fragmentation/> (引用者:http://bigbangtrans.wordpress.com)
然而,没有调用parse_item回调,我无法弄清楚原因。
这是整个蜘蛛代码:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
class TbbtSpider(CrawlSpider):
#print '\n TbbtSpider \n'
name = 'tbbt'
start_urls = ['http://bigbangtrans.wordpress.com'] # urls from which the spider will start crawling
rules = [Rule(SgmlLinkExtractor(allow=[r'series-\d{1}-episode-\d{2}.']), callback='parse_item')]
def parse_item(self, response):
print '\n parse_blogpost \n'
hxs = HtmlXPathSelector(response)
item = TbbtItem()
# Extract title
item['title'] = hxs.select('//div[@id="post-5"]/div/p/span/text()').extract() # XPath selector for title
return item
答案 0 :(得分:2)
好的,所以这段代码不起作用的原因是因为规则的语法不正确。我修改了语法而没有进行任何其他更改,我能够点击parse_item
回调。
rules = (
Rule(SgmlLinkExtractor(allow=(r'series-\d{1}-episode-\d{2}.',),
),
callback='parse_item'),
)
但是标题都是空白的,这表明hxs.select
中的parse_item
语句不正确。下面的xpath可能更合适(我对所需的标题做了一个有根据的猜测,但我可能会完全咆哮错误的树)
item['title'] = hxs.select('//h2[@class="title"]/text()').extract()