使用scrapy提取表td元素中的多个数据

时间:2016-05-25 06:21:21

标签: python scrapy

我是scrapy的新手并尝试使用它来提取以下数据 来自以下示例html代码的“name”,“address”,“state”,“postal_code”:

<div id="superheroes">
<table width="100%" border="0" ">
  <tr>
  <td valign="top">
  <h2>Superheroes in New York</h2>
  <hr/>
  </td>
  </tr>
  <tr valign="top">
    <td width="75%">                    
      <h2>Peter Parker</h2>
      <hr />
      <table width="100%">
        <tr valign="top">
          <td width="13%" height="70" valign="top"><img src="/img/spidey.jpg"/></td>
          <td width="87%" valign="top"><strong>Address:</strong> New York City<br/>
            <strong>State:</strong>New York<br/>
            <strong>Postal Code:</strong>12345<br/>
            <strong>Telephone:</strong> 555-123-4567</td>
        </tr>
        <tr>
          <td height="18" valign="top">&nbsp;</td>
          <td align="right" valign="top"><a href="spiderman"><strong>Read More</strong></a></td>
        </tr>
      </table>
      <h2>Tony Stark</h2>
      <hr />
      <table width="100%" border="0" cellpadding="2" cellspacing="2" valign="top">
        <tr valign="top">
          <td width="13%" height="70" valign="top"><img src="/img/ironman.jpg"/></td>
          <td width="87%" valign="top"><strong>Address:</strong> New York City<br/>
            <strong>State:</strong> New York<br/>
            <strong>Postal Code:</strong> 54321<br/>
            <strong>Telephone:</strong> 555-987-6543</td>
        </tr>
        <tr>
          <td height="18" valign="top">&nbsp;</td>
          <td align="right" valign="top"><a href="iron_man"><strong>Read More</strong></a></td>
        </tr>
      </table>
    </td>
    <td width="25%">
       <script async src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
       </script>    
    </td>
  </tr>
</table>
</div>

我的superheroes.py包含以下代码:

from scrapy.spider import CrawlSpider, Rule
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from superheroes.items import Superheroes

items = []

class MySpider(CrawlSpider):
  name = "superheroes"
  allowed_domains = ["www.somedomain.com"]
  start_urls = ["http://www.somedomain.com/ny"]
  rules = [Rule(SgmlLinkExtractor(allow=()), callback='parse_item')]

   def parse_item(self, response):
     sel = Selector(response)
     tables = sel.xpath('//div[contains(@id, "superheroes")]/table/tr[2]/td[1]')
     for table in tables:
        item = Superheroes()
        item['name'] = table.xpath('h2/text()').extract()
        item['address'] = table.xpath('/tr[1]/td[2]/strong[1]/text()').extract()
        item['state'] = table.xpath('/tr[1]/td[2]/strong[2]/text()').extract()
        item['postal_code'] = table.xpath('/tr[1]/td[2]/strong[3]/text()').extract()
        items.append(item)
     return items

我的items.py包含:

import scrapy
class Superheroes(scrapy.Item):
    name = scrapy.Field()
    address = scrapy.Field()
    state = scrapy.Field()
    postal_code = scrapy.Field()    

当我运行“scrapy runspider superheroes.py -o super_db -t csv”时,输出文件为空。

有人可以帮我解决上面代码中的任何错误吗?

非常感谢你的帮助!

2 个答案:

答案 0 :(得分:1)

您应该在fn main() { let x: Result<u8, std::num::ParseIntError> = foo(); println!("{:?}", x); } fn foo<T: num_traits::Num>() -> Result<T, <T as num_traits::Num>::FromStrRadixErr> { T::from_str_radix("4242", 10) } 周期和for每个项目中更改xpath表达式,而不是yield数组

return

答案 1 :(得分:1)

您的代码存在两个问题。首先,您的parse_item方法似乎没有缩进(至少,它在您的问题中看起来如何),因此不会包含在MySpider类中。 superheroes.py中从def parse_item(self, response):开始的每一行都需要在它前面有两个空格。

第二个问题是rules声明应该为页面中找到的每个链接(即parse_item)调用SgmlLinkExtractor。您可以在输出中看到它尝试获取/iron_man/spiderman - 这些页面的输出将传递给parse_item

要使用start_urls处理parse_item,您需要将其重命名为parse_start_url。如果您只有一个页面正在处理,您甚至可以摆脱rules! (请参阅documentation关于parse_start_url)。

您更新的类看起来像这样(请注意,我还在方法中移动了items;没有必要将其声明为全局):

class MySpider(CrawlSpider):
  name = "superheroes"
  allowed_domains = ["localhost"]
  start_urls = ["http://localhost:8000/page.html"]

  # indentation!
  def parse_start_url(self, response):
    sel = Selector(response)
    headers = sel.xpath('//div[contains(@id, "superheroes")]/table/tr[2]/td[1]/h2')
    for header in headers:
      item = Superheroes()

      item['name'] = header.xpath('text()')[0].extract()

      table = header.xpath('following-sibling::table')
      item['address'] = table.xpath('tr[1]/td[2]/strong[1]/following-sibling::text()')[0].extract().strip()
      item['state'] = table.xpath('tr[1]/td[2]/strong[2]/following-sibling::text()')[0].extract().strip()
      item['postal_code'] = table.xpath('tr[1]/td[2]/strong[3]/following-sibling::text()')[0].extract().strip()

      yield item

编辑:感谢@Daniil Mashkin指出原始xpath表达式不起作用。我在上面的代码中纠正了它们。干杯!