Question

我试图从包含各种HTML元素和一系列嵌套表的页面中抓取项目。

我有一些代码正在从表X中成功地抓取，其中class =“ClassA”并将表格元素输出到一系列项目中，例如公司地址，电话号码，网站地址等。

我想在我输出的列表中添加一些额外的项目，但是要删除的其他项目不在同一个表格中，有些甚至根本不在表格中，例如＆lt ; H1>标记在页面的另一部分。

如何使用xpath过滤器将一些其他项添加到输出中并使它们出现在相同的数组/输出结构中？我注意到如果我从另一个表中删除额外的表项（即使表具有完全相同的CLASS名称和ID），其他项的CSV输出也会在CSV中的不同行上输出，而不保持CSV结构的完整性：（< / p>

我确定必须有一种方法让项目在csv输出中保持统一，即使它们是从页面上稍微不同的区域中删除的吗？希望它只是一个简单的修复...

----- HTML示例页面被刮掉-----

<html>
<head></head>
<body>

< // huge amount of other HTML and tables NOT to be scraped >

<h2>HEADING TO BE SCRAPED - Company Name</h2>
<p>Company Description</p>

< table cellspacing="0" class="contenttable company-details">
<tr>
  <th>Item Code</th>
  <td>IT123</td>
</tr>
  <th>Listing Date</th>
  <td>12 September, 2011</td>
</tr>
<tr>
  <th>Internet Address</th>
  <td class="altrow"><a href="http://www.website.com/" target="_top">http://www.website.com/</a></td>
</tr>
<tr>
  <th>Office Address</th>
  <td>123 Example Street</td>
</tr>    
<tr>
  <th>Office Telephone</th>
  <td>(01) 1234 5678</td>
</tr>       
</table>

<table cellspacing="0" class="contenttable" id="staff">
<tr><th>Management Names</th></tr>
<tr>
    <td>
    Mr John Citizen (CEO)<br/>Mrs Mary Doe (Director)<br/>Dr J. Watson (Manager)<br/>
    </td>
</tr>
</table>

<table cellspacing="0" class="contenttable company-details">    
<tr>
    <th>Contact Person</th>
    <td>        
    Mr John Citizen<br/>        
    </td>
</tr>   
<tr>
    <th class=principal>Company Mission</th>
    <td>ACME Corp is a retail sales company.</td>
</tr>   
</table>

</body>
</html>

---- SCRAPY CODE EXAMPLE ----

from scrapy.spider import Spider
from scrapy.selector import Selector
from my.items import AsxItem

class MySpider(Spider):
name = "my"
allowed_domains = ["website.com"]
start_urls = ["http://www.website.com/ABC" ]

def parse(self, response):
   sel = Selector(response)
   sites = sel.xpath('//table[@class="contenttable company-details"]')
   items = []

   for site in sites:
      item = MyItem()
      item['Company_name'] = site.xpath('.//h1//text()').extract()
      item['Item_Code'] = site.xpath('.//th[text()="Item Code"]/following-sibling::td//text()').extract()
      item['Listing_Date'] = site.xpath('.//th[text()="Listing Date"]/following-sibling::td//text()').extract()
      item['Website_URL'] = site.xpath('.//th[text()="Internet Address"]/following-sibling::td//text()').extract()
      item['Office_Address'] = site.xpath('.//th[text()="Office Address"]/following-sibling::td//text()').extract()
      item['Office_Phone'] = site.xpath('.//th[text()="Office Telephone"]/following-sibling::td//text()').extract()
      item['Company_Mission'] = site.xpath('//th[text()="Company Mission"]/following-sibling::td//text()').extract()
      yield item

输出到CSV

scrapy crawl my -o items.csv -t csv

使用上面的示例代码，[公司任务]项目显示在CSV中与其他项目不同的行（猜测因为它在不同的表中），即使它具有相同的CLASS名称和ID，另外我不确定如何刮掉＆lt; H1>字段，因为它落在我当前的XPATH站点过滤器的表结构之外？

我可以扩展网站XPATH过滤器以包含更多内容，但不会那么有效并且无法一起过滤掉这一点吗？

以下是调试日志的示例，您可以看到公司任务由于某种原因正在处理两次，第一个循环为空，这必须是它输出到CSV中的新行的原因，但为什么??

{'Item_Code': [u'ABC'],
 'Listing_Date': [u'1 January, 2000'],
 'Office_Address': [u'Level 1, Some Street, SYDNEY, NSW, AUSTRALIA, 2000'],
 'Office_Fax': [u'(02) 1234 5678'],
 'Office_Phone': [u'(02) 1234 5678'],
 'Company_Mission': [],
 'Website_URL': [u'http://www.company.com']}
2014-02-06 16:32:13+1000 [my] DEBUG: Scraped from <200 http://www.website.com/Code=ABC>
{'Item_Code': [],
 'Listing_Date': [],
 'Office_Address': [],
 'Office_Fax': [],
 'Office_Phone': [],
 'Company_Mission': [u'The comapany is involved in retail, food and beverage, wholesale services.'],
 'Website_URL': []}

我完全不知道的另一件事是，为什么这些项目在CSV中以与HTML页面上的项目完全不同的顺序以及我在spiders配置文件中定义的顺序吐出。 scrapy是否完全异步地以任何顺序返回项目？

Answer 1

guessing because its in a different table - 错误的猜测，表和项之间没有相关性，事实上，只要你设置了项目字段，数据来自哪里都无关紧要。

意味着您可以从任何地方获取Company_name和Company_Mission。

话虽如此，检查从//th[text()="Company Mission"]返回的内容以及它在页面上显示的次数，而其他项xpath是相对的（以.开头）这个是绝对的（从//），它可能会刮取一个项目列表而不只是一个

Answer 2

我了解您要为此页面抓取1个项目，但//table[@class="contenttable company-details"]匹配HTML内容中的2个表格元素，因此for site in sites:将运行两次，创建2个项目。

对于每个表，XPath表达式将在当前表中应用（如果它们是相对的 - .//th[text()="Item Code"]）。绝对XPath表达式（例如//th[text()="Company Mission"]）将从HTML文档的根元素中查找元素。

您的示例输出只显示"Company_Mission"一次，而您说它出现两次。而且因为你正在使用绝对的XPath表达式，它应该确实出现了两次。不确定输出是否与问题中当前的蜘蛛代码匹配。

所以，循环的第一次迭代，

    <table cellspacing="0" class="contenttable company-details">
    <tr>
      <th>Item Code</th>
      <td>IT123</td>
    </tr>
      <th>Listing Date</th>
      <td>12 September, 2011</td>
    </tr>
    <tr>
      <th>Internet Address</th>
      <td class="altrow"><a href="http://www.website.com/" target="_top">http://www.website.com/</a></td>
    </tr>
    <tr>
      <th>Office Address</th>
      <td>123 Example Street</td>
    </tr>    
    <tr>
      <th>Office Telephone</th>
      <td>(01) 1234 5678</td>
    </tr>       
    </table>

你可以刮掉：

商品代码
上市日期
互联网地址 - ＆gt;网站网址
办公地址
办公室电话

并且因为您使用的是绝对XPath表达式，//th[text()="Company Mission"]/following-sibling::td//text()将在文档中的任何位置查找，而不仅仅是在第一个<table cellspacing="0" class="contenttable company-details">

中

这些提取的字段会进入自己的项目。

然后是第2个表与您的sites的XPath匹配：

    <table cellspacing="0" class="contenttable company-details">    
    <tr>
        <th>Contact Person</th>
        <td>        
        Mr John Citizen<br/>        
        </td>
    </tr>   
    <tr>
        <th class=principal>Company Mission</th>
        <td>ACME Corp is a retail sales company.</td>
    </tr>   
    </table>

为其实例化了一个新的MyItem()，这里没有XPath表达式匹配，除了＆＃34;公司任务＆＃34;的绝对XPath，所以在循环迭代结束时，你＆＃39;只有＆＃34;公司使命＆＃34;。

如果您确定只需要此页面中的1个且只有1个项目，则可以为所需的每个字段使用更长的XPath，例如//table[@class="contenttable company-details"]//th[text()="Item Code"]/following-sibling::td//text()，以便它与第1个或第2个表匹配，

并仅使用1 MyItem()个实例。

此外，您可以尝试CSS选择器，这些选择器的读写时间更短，更易于维护：

＆＃34; COMPANY_NAME＆＃34; ＆lt; - sel.css('h2::text')
＆＃34; ITEM_CODE＆＃34; ＆lt; - sel.css('table.company-details th:contains("Item Code") + td::text')
＆＃34; Listing_Date＆＃34; ＆lt; - sel.css('table.company-details th:contains("Listing Date") + td::text')
等

请注意，:contains()在Scrapy中通过下面的cssselect可用，但它不是标准的（从CSS规范中删除，但很方便），::text伪元素选择器也是非标准但是Scrapy扩展，也很方便。

Scrapy使用xpath输出多个项目元素到单个csv？

2 个答案: