Question

我是Python的新手（也是stackoverflow的新手），刚开始使用Scrapy。我希望从不同的网站获取一些爱好产品信息。我已经阅读了教程，感觉非常好。我想要的是表中列出的手表的属性，但它们在第二个表中都具有相同的类（“productTitle”）。

<table border="0" cellspacing="0" cellpadding="4">
  <tbody>
    <tr>
      <td class="productTitle creditCardPrice" valign="top">
        <strong>Regular Price:</strong>
      </td> 
      <td valign="top">$9,072</td>
    </tr>
    <tr>
      <td class="productTitle retailPrice" valign="top">
        <strong>Retail Price:</strong>
      </td> 
      <td valign="top">$12,350</td> 
    </tr>
    <tr>
      <td class="productTitle itemNumber" valign="top">
        <strong>Item Number:</strong>
      </td> 
      <td valign="top">112555</td> 
    </tr>
  </tbody>
</table>

第二张表：

<table border="0" cellpadding="4" cellspacing="0">
  <tbody>
    <tr style="height: 15px;">
      <td class="productTitle" style="height: 15px;" valign="top"> .     
        <strong>Manufacturer:</strong>
      </td> 
      <td style="height: 15px;" valign="top">Rolex</td> 
    </tr>
    <tr style="height: 30px;">
      <td class="productTitle" style="height: 30px;" valign="top">
        <strong>Model Name/Number:</strong>
      </td> 
      <td style="height: 30px;" valign="top">Yacht-Master 116622</td> 
    </tr>

还有更多行数据。您可以在此处查看示例：https://www.bobswatches.com/rolex-platinum-yacht-master-116622-pre-owned.html

我的目标是将所有这些数据放入.csv文件中，每列都标有“信用卡价格”，“制造商”，“型号/号码”等，然后从网站抓取我最喜欢的手表，为每张手表创建一张包含所有这些细节的表格。但是，在我到达蜘蛛穿过不同页面的部分之前，我必须让它正确抓取这一页。

我不知道如何使用Scrapy写出来。我正在反复讨论其他几个stackoverflow问题并且仍在使用该教程，但进展非常缓慢。这显然是错误的，但我在哪里：

    def parse(self, response):
    for row in response.selector.xpath('//table'):
        yield {
            'text': row.xpath('./td[1]').extract_first(),
        }

    next_page_url = response.xpath('//li[@class="next"]/a/@href').extract_first()
    if next_page_url is not None:
        yield scrapy.Request(response.urljoin(next_page_url))

Answer 1

如果我理解正确，你想提取结构化数据：该表中的行标题和行数据？

您可以通过以下方式实现这一目标：

每行提取所有行
为每行提取标题和行数据

所以这只是使用正确xpath选择器的问题。例如，像这样的东西可以解决问题：

# find all table rows
rows = response.xpath("//tr")
for row in rows:
    title = row.xpath(".//strong/text()").extract_first()
    text = ''.join(row.xpath(".//td/text()").extract()).strip('. \n')
    print(title)
    print(text)
    print('-'*80)

返回：

Regular Price:
$9,072
--------------------------------------------------------------------------------
Retail Price:
$12,350
--------------------------------------------------------------------------------
Item Number:
112555
--------------------------------------------------------------------------------
Regular Price:
$9,072
--------------------------------------------------------------------------------
Retail Price:
$12,350
--------------------------------------------------------------------------------
Item Number:
112555
--------------------------------------------------------------------------------
Manufacturer:
Rolex
--------------------------------------------------------------------------------
Model Name/Number:
Yacht-Master 116622
--------------------------------------------------------------------------------

第一次使用scrapy，尝试抓取一组表

1 个答案: