我是scrapy的新手,目前正在学习如何从抓取的URL列表中抓取信息。我已经能够通过scrapy网站上的教程从网址中抓取信息。但是,即使在网上搜索解决方案后,我也面临从网址中抓取的网址列表中的问题抓取信息。
我在下面写的刮刀能够从第一个网址上刮掉。但是,从抓取的URL列表中抓取是不成功的。问题始于def parse_following_urls(self,response):我无法从刮下的URL列表中删除
任何人都可以帮忙解决这个问题吗?提前谢谢。
import scrapy
from scrapy.http import Request
class SET(scrapy.Item):
title = scrapy.Field()
open = scrapy.Field()
hi = scrapy.Field()
lo = scrapy.Field()
last = scrapy.Field()
bid = scrapy.Field()
ask = scrapy.Field()
vol = scrapy.Field()
exp = scrapy.Field()
exrat = scrapy.Field()
exdat = scrapy.Field()
class ThaiSpider(scrapy.Spider):
name = "warrant"
allowed_domains = ["marketdata.set.or.th"]
start_urls = ["http://marketdata.set.or.th/mkt/stocklistbytype.do?market=SET&language=en&country=US&type=W"]
def parse(self, response):
for sel in response.xpath('//table[@class]/tbody/tr'):
item = SET()
item['title'] = sel.xpath('td[1]/a[contains(@href,"ssoPageId")]/text()').extract()
item['open'] = sel.xpath('td[3]/text()').extract()
item['hi'] = sel.xpath('td[4]/text()').extract()
item['lo'] = sel.xpath('td[5]/text()').extract()
item['last'] = sel.xpath('td[6]/text()').extract()
item['bid'] = sel.xpath('td[9]/text()').extract()
item['ask'] = sel.xpath('td[10]/text()').extract()
item['vol'] = sel.xpath('td[11]/text()').extract()
yield item
urll = response.xpath('//table[@class]/tbody/tr/td[1]/a[contains(@href,"ssoPageId")]/@href').extract()
urls = ["http://marketdata.set.or.th/mkt/"+ i for i in urll]
for url in urls:
request = scrapy.Request(url, callback=self.parse_following_urls, dont_filter=True)
yield request
request.meta['item'] = item
def parse_following_urls(self, response):
for sel in response.xpath('//table[3]/tbody'):
item = response.meta['item']
item['exp'] = sel.xpath('tr[1]/td[2]/text()').extract()
item['exrat'] = sel.xpath('tr[2]/td[2]/text()').extract()
item['exdat'] = sel.xpath('tr[3]/td[2]/text()').extract()
yield item
在尝试给出建议并查看输出后,我重新编写了代码。以下是编辑过的代码。但是,我收到另一个错误,指出Request url must be str or unicode, got %s:' % type(url).__name__)
。如何将URL从列表转换为字符串?
我认为URL应该是字符串,因为它在For循环中。我在下面的代码中添加了这个注释。有什么方法可以解决这个问题吗?
import scrapy
from scrapy.http import Request
class SET(scrapy.Item):
title = scrapy.Field()
open = scrapy.Field()
hi = scrapy.Field()
lo = scrapy.Field()
last = scrapy.Field()
bid = scrapy.Field()
ask = scrapy.Field()
vol = scrapy.Field()
exp = scrapy.Field()
exrat = scrapy.Field()
exdat = scrapy.Field()
class ThaiSpider(scrapy.Spider):
name = "warrant"
allowed_domains = ["marketdata.set.or.th"]
start_urls = ["http://marketdata.set.or.th/mkt/stocklistbytype.do?market=SET&language=en&country=US&type=W"]
def parse(self, response):
for sel in response.xpath('//table[@class]/tbody/tr'):
item = SET()
item['title'] = sel.xpath('td[1]/a[contains(@href,"ssoPageId")]/text()').extract()
item['open'] = sel.xpath('td[3]/text()').extract()
item['hi'] = sel.xpath('td[4]/text()').extract()
item['lo'] = sel.xpath('td[5]/text()').extract()
item['last'] = sel.xpath('td[6]/text()').extract()
item['bid'] = sel.xpath('td[9]/text()').extract()
item['ask'] = sel.xpath('td[10]/text()').extract()
item['vol'] = sel.xpath('td[11]/text()').extract()
url = ["http://marketdata.set.or.th/mkt/"]+ sel.xpath('td[1]/a[contains(@href,"ssoPageId")]/@href').extract()
request = scrapy.Request(url, callback=self.parse_following_urls, dont_filter=True) #Request url must be str or unicode, got list: How to solve this?
request.meta['item'] = item
yield item
yield request
def parse_following_urls(self, response):
for sel in response.xpath('//table[3]/tbody'):
item = response.meta['item']
item['exp'] = sel.xpath('tr[1]/td[2]/text()').extract()
item['exrat'] = sel.xpath('tr[2]/td[2]/text()').extract()
item['exdat'] = sel.xpath('tr[3]/td[2]/text()').extract()
yield item
答案 0 :(得分:0)
我尝试更改逆5行
item = response.meta['item']
到
item = SET()
然后它的作品! 实际上我并没有意识到你的“元”方式,因为我从来没有用它来描述项目。
答案 1 :(得分:0)
我知道你在这里要做什么,它被称为 - 链接请求。
这意味着您希望继续让angular.module('app', [
require('./path/to/myModule') // runs side-effect (registers the module)
])
屈服,并在Request
中继续使用已填充的Item
s Request
属性。
对于您的情况,您需要做的只是让meta
产生一个Item
,其中包含一个项目。将您的解析更改为:
Request