Scrapy使用图像和文本更正未命名div的xpath

时间:2016-05-12 06:23:01

标签: python xpath web-scraping scrapy

我正在构建一个Spider,它遍历几个分页页面并从站点中提取数据: http://www.usnews.com/education/best-global-universities/neuroscience-behavior

这是蜘蛛:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.contrib.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from lxml import html
from usnews.items import UsnewsItem


class UniversitiesSpider(scrapy.Spider):
    name = "universities"
    allowed_domains = ["usnews.com"]
    start_urls = (
        'http://www.usnews.com/education/best-global-universities/neuroscience-behavior/',
        )

    #Rules = [
    #Rule(LinkExtractor(allow=(), restrict_xpaths=('.//a[@class="pager_link"]',)), callback="parse", follow= True)
    #]

    def parse(self, response):
        for sel in response.xpath('.//div[@class="sep"]'):
            item = UsnewsItem()
            item['name'] = sel.xpath('.//h2[@class="h-taut"]/a/text()').extract()
            item['location'] = sel.xpath('.//span[@class="t-dim t-small"]/text()').extract()
            item['ranking'] = sel.xpath('.//div[3]/div[2]/text()').extract()
            item['score'] = sel.xpath('.//div[@class="t-large t-strong t-constricted"]/text()').extract()
            #print(sel.xpath('.//text()').extract()
            yield item

我在提取项目的文本时遇到问题"排名"。根据google chomes xpath建议,xpath是://*[@id="resultsMain"]/div[1]/div[1]/div[3]/div[2],它给出了第一个条目的单个数字和一堆空值。它似乎是在img标签内实现的,我对如何访问它只是提取thext感到困惑(例如#1,#22等)。

1 个答案:

答案 0 :(得分:1)

以下XPath应该找到包含div子节点的img,然后返回包含“排名”的非空文本节点子节点:

for sel in response.xpath('.//div[@class="sep"]'):
    ...
    item['ranking'] = sel.xpath('div/div[img]/text()[normalize-space()]').extract()