按照help我设计了刮刀如下:
import scrapy
from lankatable.items import LankatableItem
class TableScraper(scrapy.Spider):
"""docstring for TableScraper."""
name = "table"
allowed_domains = ["lankabd.com"]
start_urls = [
"http://lankabd.com/dse/stock-market/GSPFINANCE/gsp-finance-company-(bangladesh)-limited-/financial-statements?companyId=300&stockId=287",
]
def parse(self,response):
Item = LankatableItem()
Item['industry'] = response.css('.portalTitleL2 ::text').extract_first().split(' - ')[-2]
Item['ticker'] = response.css('.portalTitle.companyTitle ::text').extract_first().split(' (')[-1].strip(')')
Item['yearEnd'] = response.css('.note>font::text').extract_first()
# text in a row-cell
Item['summery'] = {}
for tr in response.xpath(".//*[@id='summery']/table/tbody/tr"):
Item['summery']['title'] = tr.xpath('/td[1]/text()').extract_first().strip()
Item['summery']['y2011'] = tr.xpath('/td[2]/span/text()').extract_first().strip()
print Item
print "Hello World!"
项目为:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class LankatableItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
ticker = scrapy.Field()
industry = scrapy.Field()
yearEnd = scrapy.Field()
summery = scrapy.Field() # should hold 'summery' table from the page
balance = scrapy.Field() # should hold 'Balance-sheet' table from the page
income = scrapy.Field() # should hold 'income-statemnt' table from the page
cash = scrapy.Field() # should hold 'cash-flow' table from the page
但它并没有抓住任何东西。不要理解我的代码中缺少什么!任何帮助都非常感谢。由于response
在内部包含Xpath
,因此我未在代码中使用HtmlXpathSelector。
我在根文件夹中使用scrapy crawl table
运行它。
答案 0 :(得分:1)
你的xpath无法工作的原因是tbody
。您必须将其删除并检查是否得到了您想要的结果。
您可以在scrapy文档中阅读:http://doc.scrapy.org/en/0.14/topics/firefox.html
Firefox尤其以添加
<tbody>
元素而着称 表。另一方面,Scrapy不会修改原始页面 HTML,因此如果您使用<tbody>
,则无法提取任何数据 你的XPath表达式。