Stack Overflow中的Web抓取与抓取,但我无法获得问题的票

时间:2018-11-07 19:43:08

标签: python web-scraping scrapy scrapy-spider

我正在抓紧Stack Overflow,已经抓到标题,URL和标签,但是我无法抓到每个问题的票。有人能帮我吗?我对xpath不太满意

def parse_item(self, response):
    questions = response.xpath('//div[@class="question-summary"]')

    for question in questions:
        item = StackItem()
        item['url'] = question.xpath(
            'div[@class="summary"]/h3/a[@class="question-hyperlink"]/@href').extract()[0]
        item['title'] = question.xpath(
            'div[@class="summary"]/h3/a[@class="question-hyperlink"]/text()').extract()[0]
        item['tags'] = question.xpath(
            'div[@class="summary"]/div[2]/a[@class="post-tag"]/text()').extract()
        item['votes'] = question.xpath(
            '/div[1]/div[1]/div[1]/div[1]/span/strong/textContent()').extract()[0]

        yield item

我正在抓取页面: https://stackoverflow.com/questions?page=2&sort=newest

2 个答案:

答案 0 :(得分:1)

item['votes'] = question.css('.vote-count-post > strong::text').extract()[0]

答案 1 :(得分:1)

如果要使用xpath

item['votes'] = question.xpath(".//div[@class='votes']//strong/text()").extract_first()

请注意.//div xpath前缀的点 Check scrapy doc