对于一项任务,我正在尝试构建一个蜘蛛,该蜘蛛可以从“ www.kaercher.com”网上商店获取数据。正在通过AJAX调用来调用网上商店中的所有产品。为了加载更多产品,必须按下名为“显示更多产品”的按钮。我设法从AJAX调用正在调用的相应URL中获取所需的数据。
但是,对于我的任务,我想获取某个产品的所有(所有产品/页面)。我一直在挖掘,但找不到解决方案。我想我想用“ isTruncated = true”做些事情,true表示可以加载更多产品,false意味着没有更多产品。 (已修复)
当我设法从所有页面中获取数据时,我需要找到一种从产品列表中获取所有数据的方法(用多个kaercher产品创建一个.csv文件,每个产品都有一个唯一的可以在URL中看到的ID,在这种情况下,ID 20035386用于高压清洗机。) (固定)
链接: 网上商店:https://www.kaercher.com/nl/webshop/hogedrukreinigers-resultaten.html
高压清洗机:https://www.kaercher.com/nl/webshop/hogedrukreinigers-resultaten.html
旧代码
蜘蛛文件
import scrapy
from krc.items import KrcItem
import json
class KRCSpider(scrapy.Spider):
name = "krc_spider"
allowed_domains = ["kaercher.com"]
start_urls = ['https://www.kaercher.com/api/v1/products/search/shoppableproducts/partial/20035386?page=1&size=8&isocode=nl-NL']
def parse(self, response):
item = KrcItem()
data = json.loads(response.text)
for company in data.get('products', []):
item["productid"] = company["id"]
item["name"] = company["name"]
item["description"] = company["description"]
item["price"] = company["priceFormatted"]
yield item
项目文件
import scrapy
class KrcItem(scrapy.Item):
productid=scrapy.Field()
name=scrapy.Field()
description=scrapy.Field()
price=scrapy.Field()
pass
新代码
编辑:15/08/2019
由于@gangabass,我设法从所有产品页面中获取数据。我还设法从keyword.csv文件中列出的不同产品中获取数据。这使我能够从产品列表中获取数据。参见下面的新代码:
蜘蛛文件(.py)
import scrapy
from krc.items import KrcItem
import json
import os
import csv
class KRCSpider(scrapy.Spider):
name = "krc_spider"
allowed_domains = ["kaercher.com"]
start_urls = ['https://www.kaercher.com/api/v1/products/search/shoppableproducts/partial/20035386?page=1&size=8&isocode=nl-NL']
def start_requests(self):
"""Read keywords from keywords file amd construct the search URL"""
with open(os.path.join(os.path.dirname(__file__), "../resources/keywords.csv")) as search_keywords:
for keyword in csv.DictReader(search_keywords):
search_text=keyword["keyword"]
url="https://www.kaercher.com/api/v1/products/search/shoppableproducts/partial/{0}?page=1&size=8&isocode=nl-NL".format(
search_text)
# The meta is used to send our search text into the parser as metadata
yield scrapy.Request(url, callback = self.parse, meta = {"search_text": search_text})
def parse(self, response):
current_page = response.meta.get("page", 1)
next_page = current_page + 1
item = KrcItem()
data = json.loads(response.text)
for company in data.get('products', []):
item["productid"] = company["id"]
item["name"] = company["name"]
item["description"] = company["description"]
item["price"] = company["priceFormatted"].replace("\u20ac","").strip()
yield item
if data["isTruncated"]:
yield scrapy.Request(
url="https://www.kaercher.com/api/v1/products/search/shoppableproducts/partial/20035386?page={page}&size=8&isocode=nl-NL".format(page=next_page),
callback=self.parse,
meta={'page': next_page},
)
项目文件(.py)
import scrapy
class KrcItem(scrapy.Item):
productid=scrapy.Field()
name=scrapy.Field()
description=scrapy.Field()
price=scrapy.Field()
producttype=scrapy.Field()
pass
关键字文件(.csv)
keyword,keywordtype
20035386,Hogedrukreiniger
20072956,Floor Cleaner
答案 0 :(得分:0)
您可以使用response.meta
在请求之间发送当前页码:
def parse(self, response):
current_page = response.meta.get("page", 1)
next_page = current_page + 1
item = KrcItem()
data = json.loads(response.text)
for company in data.get('products', []):
item["productid"] = company["id"]
item["name"] = company["name"]
item["description"] = company["description"]
item["price"] = company["priceFormatted"]
yield item
if data["isTruncated"]:
yield scrapy.Request(
url="https://www.kaercher.com/api/v1/products/search/shoppableproducts/partial/20035386?page={page}&size=8&isocode=nl-NL".format(page=next_page),
callback=self.parse,
meta={'page': next_page},
)