我对Scrapy有问题,我正在抓取一个子页面,该子页面包含在主页上获得的链接。
每个漫画都有自己的页面,因此我尝试打开每个项目的页面并刮价。
这是蜘蛛:
class PaniniSpider(scrapy.Spider):
name = "spiderP"
start_urls = ["http://comics.panini.it/store/pub_ita_it/magazines.html"]
def parse(self, response):
# Get all the <a> tags
for sel in response.xpath("//div[@class='list-group']//h3/a"):
l = ItemLoader(item=ComicscraperItem(), selector=sel)
l.add_xpath('title', './text()')
l.add_xpath('link', './@href')
request = scrapy.Request(sel.xpath('./@href').extract_first(), callback=self.parse_isbn, dont_filter=True)
request.meta['l'] = l
yield request
def parse_isbn(self, response):
l = response.meta['l']
l.add_xpath('price', "//p[@class='special-price']//span/text()")
return l.load_item()
问题与链接有关,纠纷与此类似:
{"title": "Spider-Man 14", "link": ["http://comics.panini.it/store/pub_ita_it/mmmsm014isbn-it-marvel-masterworks-spider-man-marvel-masterworks-spider.html"], "price": ["\n \u20ac\u00a022,50 ", "\n \u20ac\u00a076,50 ", "\n \u20ac\u00a022,50 ", "\n \u20ac\u00a022,50 ", "\n \u20ac\u00a022,50 ", "\n \u20ac\u00a018,00
{"title": "Avenger di John Byrne", "link": ["http://comics.panini.it/store/pub_ita_it/momae005isbn-it-omnibus-avengers-epic-collecti-marvel-omnibus-avengers-by.html"], "price": ["\n \u20ac\u00a022,50 ", "\n \u20ac\u00a076,50 ", "\n \u20ac\u00a022,50
简而言之,请求传递每个项目的链接列表,因此价格不是唯一的,而是列表的结果。
如何仅传递相关商品的链接并存储每个商品的价格?
答案 0 :(得分:1)
我看到两种方法:
使用response.xpath
在子页面中获取它
def parse_isbn(self, response):
l = response.meta['l']
price = response.xpath("//p[@class='special-price']//span/text()")
# ... do something with price ...
return l.load_item()
或者在主页上获得具有所有必要信息的div-标题,链接和价格
for sel in response.xpath('//div[@id="products-list"]/div'):
l.add_xpath('title', './/h3/a/text()')
l.add_xpath('link', './/h3/a/@href')
l.add_xpath('price', './/p[@class="special-price"]//span/text()')
然后您就不必使用parse_isbn
为了进行测试,我使用了独立脚本,您可以将其放入一个文件中,而无需创建项目即可运行。
它可以正确获取价格。
import scrapy
def clean(text):
text = text.replace('\xa0', ' ')
text = text.strip().split('\n')
text = ' '.join(x.strip() for x in text)
return text
class PaniniSpider(scrapy.Spider):
name = "spiderP"
start_urls = ["http://comics.panini.it/store/pub_ita_it/magazines.html"]
def parse(self, response):
for sel in response.xpath('//div[@id="products-list"]/div'):
yield {
'title': clean(sel.xpath('.//h3/a/text()').get()),
'link': clean(sel.xpath('.//h3/a/@href').get()),
'price': clean(sel.xpath('.//p[@class="special-price"]//span/text()').get()),
}
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
'FEED_FORMAT': 'csv', # csv, json, xml
'FEED_URI': 'output.csv', #
})
c.crawl(PaniniSpider)
c.start()
编辑:如果您必须加载其他页面,则可以将add_value
与response.xpath().get()
一起使用,而不是add_xpath
def parse_isbn(self, response):
l = response.meta['l']
l.add_value('price', response.xpath("//p[@class='special-price']//span/text()").get())
return l.load_item()
完整示例:
import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose
def clean(text):
text = text.replace('\xa0', ' ')
text = text.strip().split('\n')
text = ' '.join(x.strip() for x in text)
return text
class ComicscraperItem(scrapy.Item):
title = scrapy.Field(input_processor=MapCompose(clean))
link = scrapy.Field()
price = scrapy.Field(input_processor=MapCompose(clean))
class PaniniSpider(scrapy.Spider):
name = "spiderP"
start_urls = ["http://comics.panini.it/store/pub_ita_it/magazines.html"]
def parse(self, response):
# Get all the <a> tags
for sel in response.xpath("//div[@class='list-group']//h3/a"):
l = ItemLoader(item=ComicscraperItem(), selector=sel)
l.add_xpath('title', './text()')
l.add_xpath('link', './@href')
request = scrapy.Request(sel.xpath('./@href').extract_first(), callback=self.parse_isbn, dont_filter=True)
request.meta['l'] = l
yield request
def parse_isbn(self, response):
l = response.meta['l']
l.add_value('price', response.xpath("//p[@class='special-price']//span/text()").get())
return l.load_item()
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
'FEED_FORMAT': 'csv', # csv, json, xml
'FEED_URI': 'output.csv', #
})
c.crawl(PaniniSpider)
c.start()
答案 1 :(得分:-1)
通过继承scrapy的项目加载器来创建项目加载器,并应用default_output_processor = TakeFirst()
例如
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst
class DefaultItemLoader(ItemLoader):
link_output_processor = TakeFirst()
您还可以参考我的项目https://github.com/yashpokar/amazon-crawler/blob/master/amazon/loaders.py