试图用scrapy抓取网站-未收到任何数据

时间:2019-08-11 12:28:03

标签: scrapy

对于作业,我必须从Kaercher网上商店获取数据。我需要获取的数据是产品标题,描述和价格。

此外,我还需要能够使用同一脚本来获取多个产品(高压清洁器,真空清洁器等)。因此,我可能需要制作一个.csv关键字文件或进行相应调整的URL。

但是,我似乎无法使用当前脚本来获取数据。

信息:我将添加整个文件结构和当前代码。我只调整了实际的蜘蛛文件(karcher_crawler.py),其他文件大多是默认文件。

我的文件夹结构:

scrapy_karcher/ # Project root directory
    scrapy.cfg  # Contains the configuration information to deploy the spider
    scrapy_karcher/ # Project's python module
        __init__.py
        items.py      # Describes the definition of each item that we’re scraping
        middlewares.py  # Project middlewares
        pipelines.py     # Project pipelines file
        settings.py      # Project settings file
        spiders/         # All the spider code goes into this directory
            __init__.py
            karcher_crawler.py # The spider

我的“ karcher_crawler.py”代码

import scrapy

class KarcherCrawlerSpider(scrapy.Spider):
    name = 'karcher_crawler'
    start_urls = [
        'https://www.kaercher.com/nl/webshop/hogedrukreinigers-resultaten.html'
    ]

    def parse(self, response):
        products=response.xpath("//div[@class='col-sm-3 col-xs-6 fg-products-item']")
        # iterating over search results
        for product in products:
            # Defining the XPaths
            XPATH_PRODUCT_NAME=".//div[@class='product-info']//h6[contains(@class,'product-label')]//a/text()"
            XPATH_PRODUCT_PRICE=".//div[@class='product-info']//div[@class='product-price']//span/text()"
            XPATH_PRODUCT_DESCRIPTION=".//div[@class='product-info']//div[@class='product-description']//a/text()"

            raw_product_name=product.xpath(XPATH_PRODUCT_NAME).extract()
            raw_product_price=product.xpath(XPATH_PRODUCT_PRICE).extract()
            raw_product_description=product.xpath(XPATH_PRODUCT_DESCRIPTION).extract()

            # cleaning the data
            product_name=''.join(raw_product_name).strip(
            ) if raw_product_name else None
            product_price=''.join(raw_product_price).strip(
            ) if raw_product_price else None
            product_description=''.join(raw_product_description).strip(
            ) if raw_product_description else None

            yield {
                'product_name': product_name,
                'product_price': product_price,
                'product_description': product_description,
            }

我的“ items.py”代码:

import scrapy


class ScrapyKarcherItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

我的“ pipelines.py”代码:

class ScrapyKarcherPipeline(object):
    def process_item(self, item, spider):
        return item

我的“ scrapy.cfg”代码:

[settings]
default = scrapy_karcher.settings

[deploy]
#url = http://localhost:6800/
project = scrapy_karcher

1 个答案:

答案 0 :(得分:0)

我设法使用以下代码请求了所需数据:

蜘蛛文件(.py)

import scrapy
from krc.items import KrcItem
import json

class KRCSpider(scrapy.Spider):
    name = "krc_spider"
    allowed_domains = ["kaercher.com"]
    start_urls = ['https://www.kaercher.com/api/v1/products/search/shoppableproducts/partial/20035386?page=1&size=8&isocode=nl-NL']

    def parse(self, response):
        item = KrcItem()
        data = json.loads(response.text)
        for company in data.get('products', []):
            item["productid"] = company["id"]
            item["name"] = company["name"]
            item["description"] = company["description"]
            item["price"] = company["priceFormatted"]
            yield item

项文件(.py。

import scrapy


class KrcItem(scrapy.Item):
    productid=scrapy.Field()
    name=scrapy.Field()
    description=scrapy.Field()
    price=scrapy.Field()
    pass

由于@gangabass,我设法找到了包含我需要提取的数据的URL。 (当您检查网页时,可以在“网络”标签中找到它们(按F12键或右键单击要检查的地方)。