对于作业,我必须从Kaercher网上商店获取数据。我需要获取的数据是产品标题,描述和价格。
此外,我还需要能够使用同一脚本来获取多个产品(高压清洁器,真空清洁器等)。因此,我可能需要制作一个.csv关键字文件或进行相应调整的URL。
但是,我似乎无法使用当前脚本来获取数据。
信息:我将添加整个文件结构和当前代码。我只调整了实际的蜘蛛文件(karcher_crawler.py),其他文件大多是默认文件。
我的文件夹结构:
scrapy_karcher/ # Project root directory
scrapy.cfg # Contains the configuration information to deploy the spider
scrapy_karcher/ # Project's python module
__init__.py
items.py # Describes the definition of each item that we’re scraping
middlewares.py # Project middlewares
pipelines.py # Project pipelines file
settings.py # Project settings file
spiders/ # All the spider code goes into this directory
__init__.py
karcher_crawler.py # The spider
我的“ karcher_crawler.py”代码
import scrapy
class KarcherCrawlerSpider(scrapy.Spider):
name = 'karcher_crawler'
start_urls = [
'https://www.kaercher.com/nl/webshop/hogedrukreinigers-resultaten.html'
]
def parse(self, response):
products=response.xpath("//div[@class='col-sm-3 col-xs-6 fg-products-item']")
# iterating over search results
for product in products:
# Defining the XPaths
XPATH_PRODUCT_NAME=".//div[@class='product-info']//h6[contains(@class,'product-label')]//a/text()"
XPATH_PRODUCT_PRICE=".//div[@class='product-info']//div[@class='product-price']//span/text()"
XPATH_PRODUCT_DESCRIPTION=".//div[@class='product-info']//div[@class='product-description']//a/text()"
raw_product_name=product.xpath(XPATH_PRODUCT_NAME).extract()
raw_product_price=product.xpath(XPATH_PRODUCT_PRICE).extract()
raw_product_description=product.xpath(XPATH_PRODUCT_DESCRIPTION).extract()
# cleaning the data
product_name=''.join(raw_product_name).strip(
) if raw_product_name else None
product_price=''.join(raw_product_price).strip(
) if raw_product_price else None
product_description=''.join(raw_product_description).strip(
) if raw_product_description else None
yield {
'product_name': product_name,
'product_price': product_price,
'product_description': product_description,
}
我的“ items.py”代码:
import scrapy
class ScrapyKarcherItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
我的“ pipelines.py”代码:
class ScrapyKarcherPipeline(object):
def process_item(self, item, spider):
return item
我的“ scrapy.cfg”代码:
[settings]
default = scrapy_karcher.settings
[deploy]
#url = http://localhost:6800/
project = scrapy_karcher
答案 0 :(得分:0)
我设法使用以下代码请求了所需数据:
蜘蛛文件(.py)
import scrapy
from krc.items import KrcItem
import json
class KRCSpider(scrapy.Spider):
name = "krc_spider"
allowed_domains = ["kaercher.com"]
start_urls = ['https://www.kaercher.com/api/v1/products/search/shoppableproducts/partial/20035386?page=1&size=8&isocode=nl-NL']
def parse(self, response):
item = KrcItem()
data = json.loads(response.text)
for company in data.get('products', []):
item["productid"] = company["id"]
item["name"] = company["name"]
item["description"] = company["description"]
item["price"] = company["priceFormatted"]
yield item
项文件(.py。
import scrapy
class KrcItem(scrapy.Item):
productid=scrapy.Field()
name=scrapy.Field()
description=scrapy.Field()
price=scrapy.Field()
pass
由于@gangabass,我设法找到了包含我需要提取的数据的URL。 (当您检查网页时,可以在“网络”标签中找到它们(按F12键或右键单击要检查的地方)。