Question

我正在尝试使用Scrapy抓取一系列游戏商店，并且他们都遇到了同样的问题。我正在使用XPath，游戏价格的HTML会根据价格是仅标记为£ 20.09还是£ 20.09标记为一行，然后£ 14.49来显示折扣。我很高兴有两列，was 20.09（包含空值）和now 14.49之后的一列，但我无法弄清楚如何使用空值而不是仅取代所有列以下是。

这是我的网站代码cdkeys - https://www.cdkeys.com/pc/games?limit=50，它既有折扣也有折扣。

allowed_urls = ['https://www.cdkeys.com/pc/games?limit=50?']
start_urls = ['https://www.cdkeys.com/pc/games/{pageno}?limit=50'.format(pageno=pageno)
    for pageno in range(1, 10)]

def parse(self, response):
    Games = response.xpath('//*[@id="root-wrapper"]/div/div[1]/div[2]/div[3]/div[2]/div[2]/ul/li/h2/a/text()').extract()
    Prices = response.xpath('//span[starts-with(@id, "product-price-")]/span[1]/span/text()').extract()
    for i, (Game, Price) in enumerate(zip(Games, Prices)):
        yield {'index': i, 'Game': Game, 'Price':Price}

问题在于价格的XPath，我可以获得仅折扣价格的列表，或者仅针对没有折扣的游戏的价格列表，因为这些类别的HTML非常不同。

阻止我简单创建两个列表的原因是，由于我使用zip和enumerate，因此只需迭代第一个x数量的游戏，直到它耗尽了价格，而不是将每个游戏与相应的价格联系起来。

任何有关在Prices中仅显示正确价格的帮助，或者找到一种空值而不是替换以下值的方法都将非常感激。我对python和网络爬行都很陌生，只是试图了解所有这些。

Answer 1

我会采用不同的方式 - 逐个迭代产品，然后找到游戏名称，常规价格和折扣价格：

def parse(self, response):
    for game in response.css("ul.products-grid li.item"):
        name = game.css("h2.product-name > a::text").extract_first()
        old_price = game.css(".regular-price .price::text,.old-price .price::text").extract_first()
        discount_price = game.css(".special-price .price::text").extract_first()

        yield {
            "name": name,
            "old_price": old_price,
            "discount_price": discount_price
        }

对于第一页，您将获得以下输出：

{'old_price': u'$ 13.59', 'name': u'Stellaris: Utopia PC DLC', 'discount_price': None}
{'old_price': u' $ 9.49 ', 'name': u'Insurgency PC', 'discount_price': u' $ 1.99 '}
...
{'old_price': u' $ 81.59 ', 'name': u'Call of Duty Black Ops II 2 Digital Deluxe Edition PC ', 'discount_price': u' $ 13.59 '}

请注意旧价格如何在有折扣和没有折扣的情况下填写。

使用Scrapy抓取游戏商店的麻烦 - 如果有折扣，则会更改HTML处理null

1 个答案: