我正在使用Scrapy库来抓取网站上的数据。
我从抓取网站获得结果,我想将其保存到数据库。我使用Scrapy项目和管道。
我得到一个列表,因此我需要使用for
循环来保存item
。但问题是列表中唯一的最后一项是保存的。
我的代码如下:
def parse(self, response):
vehicles = []
total_results = response.css('.cl-filters-summary-counter::text').extract_first().replace('.', '')
reference_urls = []
for url in response.css('.cldt-summary-titles'):
reference_url = url.css("a::attr(href)").extract_first().strip(' \t\n\r')
reference_urls.append(reference_url)
ids = []
for item in response.css('.cldt-summary-full-item'):
car_id = item.css("::attr(id)").extract_first().strip(' \t\n\rli-')
ids.append(car_id)
for item in response.css('.cldt-price'):
dirty_price = item.css("::text").extract_first().strip(' \t\n\r')
comma = dirty_price.index(",-")
price = dirty_price[2:comma].replace('.', '')
prices.append(price)
for item in zip(ids, reference_urls, prices):
car = CarItem()
car['reference'] = item[0]
car['reference_url'] = item[1]
car['data'] = ""
car['price'] = item[2]
return car
我从抓取中得到的结果很好。如果我在for
循环中执行如下操作:
vehicles = []
for item in zip(ids, reference_urls, prices):
scraped_info = {
"reference": item[0],
"reference_url": item[1],
"price": item[2]
}
vehicles.append(scraped_info)
如果我打印vehicles
,我会得到正确的结果:
[
{
"price": "4250",
"reference": "6784086e-1afb-216d-e053-e250040a033f",
"reference_url": "some-link-1"
},
{
"price": "4250",
"reference": "c05595ac-e49e-4b71-a436-868c192ef82c",
"reference_url": "some-link-2"
},
{
"price": "4900",
"reference": "444553f2-e8fd-41c9-9244-182668544e2a",
"reference_url": "some-link-3"
}
]
更新
CarItem
只是items.py
class CarItem(scrapy.Item):
# define the fields for your item here like:
reference = scrapy.Field()
reference_url = scrapy.Field()
data = scrapy.Field()
price = scrapy.Field()
知道我做错了吗?
答案 0 :(得分:0)
根据Scrapy Document,方法parse
,以及任何其他请求回调,必须返回可迭代的Request和/或dicts或Item对象。
另外根据下面的代码示例链接,
import scrapy
from myproject.items import MyItem
class MySpider(scrapy.Spider):
name = 'example.com'
allowed_domains = ['example.com']
def start_requests(self):
yield scrapy.Request('http://www.example.com/1.html', self.parse)
yield scrapy.Request('http://www.example.com/2.html', self.parse)
yield scrapy.Request('http://www.example.com/3.html', self.parse)
def parse(self, response):
for h3 in response.xpath('//h3').extract():
yield MyItem(title=h3)
for url in response.xpath('//a/@href').extract():
yield scrapy.Request(url, callback=self.parse)
我们可以看到我们必须使用yield
从parse
函数中获取正确的结果。
tl; dr :使用return
替换最后一行中的yield
。