从Scraped URL中抓取信息

时间:2016-02-12 09:22:22

标签: python scrapy

我是scrapy的新手,目前正在学习如何从抓取的URL列表中抓取信息。我已经能够通过scrapy网站上的教程从网址中抓取信息。但是,即使在网上搜索解决方案后,我也面临从网址中抓取的网址列表中的问题抓取信息。

我在下面写的刮刀能够从第一个网址上刮掉。但是,从抓取的URL列表中抓取是不成功的。问题始于def parse_following_urls(self,response):我无法从刮下的URL列表中删除

任何人都可以帮忙解决这个问题吗?提前谢谢。

import scrapy
from scrapy.http import Request

class SET(scrapy.Item):
    title = scrapy.Field()
    open = scrapy.Field()
    hi = scrapy.Field()
    lo = scrapy.Field()
    last = scrapy.Field()
    bid = scrapy.Field()
    ask = scrapy.Field()
    vol = scrapy.Field()
    exp = scrapy.Field()
    exrat = scrapy.Field()
    exdat = scrapy.Field()

class ThaiSpider(scrapy.Spider):
    name = "warrant"
    allowed_domains = ["marketdata.set.or.th"]
    start_urls = ["http://marketdata.set.or.th/mkt/stocklistbytype.do?market=SET&language=en&country=US&type=W"]

    def parse(self, response):
        for sel in response.xpath('//table[@class]/tbody/tr'):
            item = SET()
            item['title'] = sel.xpath('td[1]/a[contains(@href,"ssoPageId")]/text()').extract()
            item['open'] = sel.xpath('td[3]/text()').extract()
            item['hi'] = sel.xpath('td[4]/text()').extract()
            item['lo'] = sel.xpath('td[5]/text()').extract()
            item['last'] = sel.xpath('td[6]/text()').extract()
            item['bid'] = sel.xpath('td[9]/text()').extract()
            item['ask'] = sel.xpath('td[10]/text()').extract()
            item['vol'] = sel.xpath('td[11]/text()').extract()
            yield item
        urll = response.xpath('//table[@class]/tbody/tr/td[1]/a[contains(@href,"ssoPageId")]/@href').extract()
        urls = ["http://marketdata.set.or.th/mkt/"+ i for i in urll]
        for url in urls:
            request = scrapy.Request(url, callback=self.parse_following_urls, dont_filter=True)
            yield request
        request.meta['item'] = item

    def parse_following_urls(self, response):
        for sel in response.xpath('//table[3]/tbody'):
            item = response.meta['item']
            item['exp'] = sel.xpath('tr[1]/td[2]/text()').extract()
            item['exrat'] = sel.xpath('tr[2]/td[2]/text()').extract()
            item['exdat'] = sel.xpath('tr[3]/td[2]/text()').extract()
            yield item

在尝试给出建议并查看输出后,我重新编写了代码。以下是编辑过的代码。但是,我收到另一个错误,指出Request url must be str or unicode, got %s:' % type(url).__name__)。如何将URL从列表转换为字符串?

我认为URL应该是字符串,因为它在For循环中。我在下面的代码中添加了这个注释。有什么方法可以解决这个问题吗?

import scrapy
from scrapy.http import Request

class SET(scrapy.Item):
    title = scrapy.Field()
    open = scrapy.Field()
    hi = scrapy.Field()
    lo = scrapy.Field()
    last = scrapy.Field()
    bid = scrapy.Field()
    ask = scrapy.Field()
    vol = scrapy.Field()
    exp = scrapy.Field()
    exrat = scrapy.Field()
    exdat = scrapy.Field()

class ThaiSpider(scrapy.Spider):
    name = "warrant"
    allowed_domains = ["marketdata.set.or.th"]
    start_urls = ["http://marketdata.set.or.th/mkt/stocklistbytype.do?market=SET&language=en&country=US&type=W"]

    def parse(self, response):
        for sel in response.xpath('//table[@class]/tbody/tr'):
            item = SET()
            item['title'] = sel.xpath('td[1]/a[contains(@href,"ssoPageId")]/text()').extract()
            item['open'] = sel.xpath('td[3]/text()').extract()
            item['hi'] = sel.xpath('td[4]/text()').extract()
            item['lo'] = sel.xpath('td[5]/text()').extract()
            item['last'] = sel.xpath('td[6]/text()').extract()
            item['bid'] = sel.xpath('td[9]/text()').extract()
            item['ask'] = sel.xpath('td[10]/text()').extract()
            item['vol'] = sel.xpath('td[11]/text()').extract()
            url = ["http://marketdata.set.or.th/mkt/"]+ sel.xpath('td[1]/a[contains(@href,"ssoPageId")]/@href').extract()
            request = scrapy.Request(url, callback=self.parse_following_urls, dont_filter=True) #Request url must be str or unicode, got list: How to solve this?
            request.meta['item'] = item
            yield item
            yield request

    def parse_following_urls(self, response):
        for sel in response.xpath('//table[3]/tbody'):
            item = response.meta['item']
            item['exp'] = sel.xpath('tr[1]/td[2]/text()').extract()
            item['exrat'] = sel.xpath('tr[2]/td[2]/text()').extract()
            item['exdat'] = sel.xpath('tr[3]/td[2]/text()').extract()
            yield item

2 个答案:

答案 0 :(得分:0)

我尝试更改逆5行

item = response.meta['item']

item = SET()

然后它的作品! 实际上我并没有意识到你的“元”方式,因为我从来没有用它来描述项目。

答案 1 :(得分:0)

我知道你在这里要做什么,它被称为 - 链接请求。

这意味着您希望继续让angular.module('app', [ require('./path/to/myModule') // runs side-effect (registers the module) ]) 屈服,并在Request中继续使用已填充的Item s Request属性。

对于您的情况,您需要做的只是让meta产生一个Item,其中包含一个项目。将您的解析更改为:

Request