Question

我一直在使用scrapy库在python 3中构建一个web scraper，我遇到了一个我不明白的问题。我已成功使用表上的inspect元素来删除其他表以获取xpath变量。但是，使用此表，我无法弄清楚如何从表中提取数据。我是HTML新手，但不是编程的新手，所以如果我离开这里，请帮助我。

此网页的一个示例是：http://land.elpasoco.com/ResidentialBuilding.aspx?schd=5317443025&bldg=1

检查页面并获取目标表的xpath会产生//*[@id="aspnetForm"]/table/tbody/tr[3]/td[1]/table/tbody/tr[1]/td/table/tbody/tr[3]/td/table

但是，在scrapy shell response.xpath(target).extract()中使用它会返回[]。尝试定位任何单个单元格似乎也提供相同的空结果。我的预期结果将是与{'Dwelling Units': 1, 'Year Built': 2010 ... }之类的内容相关的数据框或字典。任何帮助确定我出错的地方或如何获得格式化的数据都将受到赞赏。谢谢！

Answer 1

import scrapy


class ResidentialRecordsSpider(scrapy.Spider):
    name = "residential_records"

    start_urls = [
        'http://land.elpasoco.com/ResidentialBuilding.aspx?schd=5317443025&bldg=1',
    ]

    def parse(self, response):
        for record in response.xpath('//table[@width="90%"]//td'):
            key = record.xpath('./strong/text()').extract_first(default='')
            value = record.xpath('./text()').extract_first(default='')

            yield { key: value }

此处您只需执行一些数据清理

使用scrapy从HTML表中提取数据：response.xpath（）产生无

1 个答案: