我刚刚开始学习Python和Scrapy。 我的第一个项目是在包含Web安全信息的网站上爬网信息。但是,当我使用cmd运行该命令时,它说:“抓取0页(以0页/分钟),抓取0项(以0项目/分钟)”,似乎什么都没出现。如果有人能解决我的问题,我将不胜感激。
我的代码:
import scrapy
class SapoSpider(scrapy.Spider):
name = "imo"
allowed_domains = ["imovirtual.com"]
start_urls = ["https://www.imovirtual.com/arrendar/apartamento/lisboa/"]
def parse(self,response):
subpage_links = []
for i in response.css('div.offer-item-details'):
youritem = {
'preco':i.css('span.offer-item title::text').extract_first(),
'autor':i.css('li.offer-item-price::text').extract(),
'data':i.css('li.offer-item-area::text').extract(),
'data_2':i.css('li.offer-item-price-perm::text').extract()
}
subpage_link = i.css('header[class=offer-item-header] a::attr(href)').extract()
subpage_links.extend(subpage_link)
for subpage_link in subpage_links:
yield scrapy.Request(subpage_link, callback=self.parse_subpage, meta={'item':youritem})
def parse_subpage(self,response):
for j in response.css('header[class=offer-item-header] a::attr(href)'):
youritem = response.meta.get('item')
youritem['info'] = j.css(' ul.dotted-list, li.h4::text').extract()
yield youritem
答案 0 :(得分:0)
要使其正常运行,有两件事需要纠正:
您需要使用要存储结果的路径定义FEED_URI设置
您需要在response
中使用parse_subpage
,因为逻辑如下: scrapy downloads "https://www.imovirtual.com/arrendar/apartamento/lisboa/" and gives the response to
解析, you extract ads url and you ask scrapy to download each pages and give the downloaded pages to
parse_subpage . So
响应in
parse_subpage`与此https://www.imovirtual.com/anuncio/t0-totalmente-remodelado-localizacao-excelente-IDGBAY.html#913474cdaa对应,例如
这应该有效:
import scrapy
class SapoSpider(scrapy.Spider):
name = "imo"
allowed_domains = ["imovirtual.com"]
start_urls = ["https://www.imovirtual.com/arrendar/apartamento/lisboa/"]
custom_settings = {
'FEED_URI': './output.json'
}
def parse(self,response):
subpage_links = []
for i in response.css('div.offer-item-details'):
youritem = {
'preco':i.css('span.offer-item title::text').extract_first(),
'autor':i.css('li.offer-item-price::text').extract(),
'data':i.css('li.offer-item-area::text').extract(),
'data_2':i.css('li.offer-item-price-perm::text').extract()
}
subpage_link = i.css('header[class=offer-item-header] a::attr(href)').extract()
subpage_links.extend(subpage_link)
for subpage_link in subpage_links:
yield scrapy.Request(subpage_link, callback=self.parse_subpage, meta={'item':youritem})
def parse_subpage(self,response):
youritem = response.meta.get('item')
youritem['info'] = response.css(' ul.dotted-list, li.h4::text').extract()
yield youritem