我目前正在尝试使用Scrapy通过Elite Dangerous subreddit并收集帖子标题,网址和投票数。我做了前两个很好,但我不确定如何编写XPath表达式来访问投票。
selector.xpath('//div[@class="score unvoted"]').extract()
有效,但会返回当前页面上所有帖子的投票计数(而不是每个帖子的投票计数)。 response.css('div.score.unvoted').extract()
适用于每个帖子,但返回[u'<div class="score unvoted">1</div>']
,而不只是1.(我也非常想知道如何使用XPath执行此操作!:)
代码如下:
class redditSpider(CrawlSpider): # http://doc.scrapy.org/en/1.0/topics/spiders.html#scrapy.spiders.CrawlSpider
name = "reddits"
allowed_domains = ["reddit.com"]
start_urls = [
"https://www.reddit.com/r/elitedangerous",
]
rules = [
Rule(LinkExtractor(
allow=['/r/EliteDangerous/\?count=\d*&after=\w*']), # Looks for next page with RE
callback='parse_item', # What do I do with this? --- pass to self.parse_item
follow=True), # Tells spider to continue after callback
]
def parse_item(self, response):
selector_list = response.css('div.thing') # Each individual little "box" with content
for selector in selector_list:
item = RedditItem()
item['title'] = selector.xpath('div/p/a/text()').extract()
item['url'] = selector.xpath('a/@href').extract()
# item['votes'] = selector.xpath('//div[@class="score unvoted"]')
item['votes'] = selector.css('div.score.unvoted').extract()
yield item
答案 0 :(得分:2)
你走在正确的轨道上。第一种方法只需要两件事:
text()
修正版:
selector.xpath('.//div[@class="score unvoted"]/text()').extract()
而且,仅供参考,您可以使用::text
pseudo-element使第二个选项也能正常工作:
response.css('div.score.unvoted::text').extract()
答案 1 :(得分:0)
这应该有用 -
selector.xpath('//div[contains(@class, "score unvoted")]/text()').extract()