Question

按照help我设计了刮刀如下：

import scrapy

from lankatable.items import LankatableItem

class TableScraper(scrapy.Spider):
    """docstring for TableScraper."""
    name = "table"
    allowed_domains = ["lankabd.com"]
    start_urls = [
        "http://lankabd.com/dse/stock-market/GSPFINANCE/gsp-finance-company-(bangladesh)-limited-/financial-statements?companyId=300&stockId=287",
    ]

    def parse(self,response):
        Item = LankatableItem()
        Item['industry'] = response.css('.portalTitleL2 ::text').extract_first().split(' - ')[-2]
        Item['ticker']   = response.css('.portalTitle.companyTitle ::text').extract_first().split(' (')[-1].strip(')')
        Item['yearEnd']  = response.css('.note>font::text').extract_first()
        # text in a row-cell
        Item['summery'] = {}
        for tr in response.xpath(".//*[@id='summery']/table/tbody/tr"):
            Item['summery']['title'] = tr.xpath('/td[1]/text()').extract_first().strip()
            Item['summery']['y2011'] = tr.xpath('/td[2]/span/text()').extract_first().strip()
            print Item
        print "Hello World!"

项目为：

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class LankatableItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    ticker   = scrapy.Field()
    industry = scrapy.Field()
    yearEnd  = scrapy.Field()
    summery  = scrapy.Field()   # should hold 'summery' table from the page
    balance  = scrapy.Field()   # should hold 'Balance-sheet' table from the page
    income   = scrapy.Field()   # should hold 'income-statemnt' table from the page
    cash     = scrapy.Field()   # should hold 'cash-flow' table from the page

但它并没有抓住任何东西。不要理解我的代码中缺少什么！任何帮助都非常感谢。由于response在内部包含Xpath，因此我未在代码中使用HtmlXpathSelector。

我在根文件夹中使用scrapy crawl table运行它。

Answer 1

你的xpath无法工作的原因是tbody。您必须将其删除并检查是否得到了您想要的结果。

您可以在scrapy文档中阅读：http://doc.scrapy.org/en/0.14/topics/firefox.html

Firefox尤其以添加<tbody>元素而着称表。另一方面，Scrapy不会修改原始页面 HTML，因此如果您使用<tbody>，则无法提取任何数据你的XPath表达式。

Python Scrapy抓取表不应该以它应该的方式工作

1 个答案: