不知道如何XPath到特定的网站元素

时间:2015-10-13 23:18:36

标签: python css python-2.7 xpath scrapy

我目前正在尝试使用Scrapy通过Elite Dangerous subreddit并收集帖子标题,网址和投票数。我做了前两个很好,但我不确定如何编写XPath表达式来访问投票。

selector.xpath('//div[@class="score unvoted"]').extract()有效,但会返回当前页面上所有帖子的投票计数(而不是每个帖子的投票计数)。 response.css('div.score.unvoted').extract()适用于每个帖子,但返回[u'<div class="score unvoted">1</div>'],而不只是1.(我也非常想知道如何使用XPath执行此操作!:)

代码如下:

class redditSpider(CrawlSpider):  # http://doc.scrapy.org/en/1.0/topics/spiders.html#scrapy.spiders.CrawlSpider
    name = "reddits"
    allowed_domains = ["reddit.com"]
    start_urls = [
    "https://www.reddit.com/r/elitedangerous",
    ]

    rules = [
        Rule(LinkExtractor(
            allow=['/r/EliteDangerous/\?count=\d*&after=\w*']),  # Looks for next page with RE
        callback='parse_item',  # What do I do with this? --- pass to self.parse_item
        follow=True),  # Tells spider to continue after callback
    ]

    def parse_item(self, response):
        selector_list = response.css('div.thing') # Each individual little "box" with content

        for selector in selector_list:
            item = RedditItem()
            item['title'] = selector.xpath('div/p/a/text()').extract()
            item['url'] = selector.xpath('a/@href').extract()
            # item['votes'] = selector.xpath('//div[@class="score unvoted"]')
            item['votes'] = selector.css('div.score.unvoted').extract()
            yield item

2 个答案:

答案 0 :(得分:2)

你走在正确的轨道上。第一种方法只需要两件事:

修正版:

selector.xpath('.//div[@class="score unvoted"]/text()').extract()

而且,仅供参考,您可以使用::text pseudo-element使第二个选项也能正常工作:

response.css('div.score.unvoted::text').extract()

答案 1 :(得分:0)

这应该有用 -

selector.xpath('//div[contains(@class, "score unvoted")]/text()').extract()