Question

我正在研究一个项目，它涉及使用Scrapy从网站上抓取数据。之前我们使用Selenium，但现在我们必须使用Scrapy。我对Scrapy没有任何了解，但现在就可以学习。挑战之一是从网站上抓取数据，数据以表格的形式组织，尽管有下载此类数据的链接，但对我而言不起作用。

下面是表格的结构 html structure

我所有的数据都在tbody下，每个都有tr

到目前为止，我编写的伪代码是：

def parse_products(self, response):
    rows=response.xpath('//*[@id="records_table"]/tbody/')
    for i in rows:
      item = table_item()
      item['company'] = i.xpath('td[1]//text()').extract_first()
      item['naic'] = i.xpath('td[2]//text()').extract_first()
      yield item

我是否可以使用xpath正确访问表主体？不确定我指定的xpath是否正确

Answer 1

更好地说：

def parse_products(self, response):
    for row in response.css('table#records_table tr'):
      item = table_item()
      item['company'] = row.xpath('.//td[1]/text()').get()
      item['naic'] = row.xpath('.//td[2]/text()').get()
      yield item

在这里，您将按表的行进行迭代，然后获取单元格的数据。

使用Scrapy（python）收集表数据

1 个答案: