XPath:TR内的第N个TD

时间:2017-08-04 19:59:43

标签: python xpath scrapy

抓取以下页面:http://graphics.stltoday.com/apps/payrolls/salaries/teachers/detail/25074/并且我试图从表格中获取每个值(工资,职位,分区年份等)。当我尝试从scrapy shell访问这些内容时,当我使用response.xpath('//th[@scope="row"]/following-sibling::td[1]/text()').extract()时,它们全部显示。但是,当我在爬虫中执行此操作时,只显示第一个元素(区域)。有什么建议吗?

抓取代码(理想情况下,每个元素都会进入自己的变量以获得更清晰的输出:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class Spider2(CrawlSpider):
    #name of the spider
    name = 'stlteacher'

    #list of allowed domains
    allowed_domains = ['graphics.stltoday.com']

    #starting url for scraping
    start_urls = ['http://graphics.stltoday.com/apps/payrolls/salaries/teachers/']
    rules = [
    Rule(LinkExtractor(
        allow=['/apps/payrolls/salaries/teachers/[0-9]+/$']),
        follow=True),
    Rule(LinkExtractor(
        allow=['/apps/payrolls/salaries/teachers/[0-9]+/position/[0-9]+/$']),
        follow=True),
    Rule(LinkExtractor(
        allow=['/apps/payrolls/salaries/teachers/detail/[0-9]+/$']),
        callback='parse_item',
        follow=True),
    ]

    #setting the location of the output csv file
    custom_settings = {
        'FEED_FORMAT' : "csv",
        'FEED_URI' : 'tmp/stlteachers3.csv'
    }

    def parse_item(self, response):
        #Remove XML namespaces
        response.selector.remove_namespaces()

        #Extract article information
        url = response.url
        name = response.xpath('//p[@class="table__title"]/text()').extract()
        district = response.xpath('//th[@scope="row"]/following-sibling::td[1]/text()').extract()



        for item in zip(name, district):
            scraped_info = {
                'url' : url,
                'name' : item[0],
                'district' : item[1],

            }
            yield scraped_info

1 个答案:

答案 0 :(得分:3)

你的zip有点令人困惑。如果要对整个表进行爬网,则需要遍历表行并查找行名称和值。

我用这段代码得到了相当不错的结果:

def parse_item(self, response):
    name = response.xpath('//p[@class="table__title"]/text()').extract_first()
    item = {
        'name': name,
        'url': response.url
    }
    for row in response.xpath('//th[@scope="row"]'):
        row_name = row.xpath('text()').extract_first('').lower().strip(':')
        row_value = row.xpath('following-sibling::td[1]/text()').extract_first()
        item[row_name] = row_value
    yield item

返回:

{
  'name': 'Bracht, Nathan',
  'url': 'http://graphics.stltoday.com/apps/payrolls/salaries/teachers/detail/25074/',
  'district': 'Affton 101',
  'school': 'Central Office',
  'position': 'Central Office Admin.',
  'degree earned': 'Doct',
  'salary': '$152,000.00',
  'extended contract pay': None,
  'extra duty pay': None,
  'total pay (all combined)': '$152,000.00',
  'years in district': '5',
  'years in mo schools': '19',
  'multiple position detail': None
}