很难从网站上获取一些价格

时间:2019-06-07 14:09:04

标签: python web-scraping scrapy

试图从某个网站上刮取价格,但某些价格被划掉,并显示一个新价格,因此这些价格我为零。好吧,我认为我可以建立一个if语句来获得正确的价格,这有点奏效。但是,我没有得到新的价格,而是得到了划掉的价格,因为两者的标识符相同。有想法该怎么解决这个吗? Screenshot of HTML code html code of not corssed out price

  for game in response.css("tr[class^=deckdbbody]"):

            # Initialize saved_name to the extracted card name
            saved_name  = game.css("a.card_popup::text").extract_first() or saved_name
            # Now call item and set equal to saved_name and strip leading '\n' from output
            item["Card_Name"] = saved_name.strip()
            # Check to see if output is null, in the case that there are two different conditions for one card
            if item["Card_Name"] != None:
                # If not null than store value in saved_name
                saved_name = item["Card_Name"].strip()
            # If null then set null value to previous card name since if there is a null value you should have the same card name twice
            else:
                item["Card_Name"] = saved_name
            # Call item again in order to extract the condition, stock, and price using the corresponding html code from the website
            item["Condition"] = game.css("td[class^=deckdbbody].search_results_7 a::text").get()
            item["Stock"] = game.css("td[class^=deckdbbody].search_results_8::text").extract_first()
            item["Price"] = game.css("td[class^=deckdbbody].search_results_9::text").extract_first()
            if item["Price"] == None:
                item["Price"] = game.css("td[class^=deckdbbody].search_results_9 span::text").get()

            # Return values
            yield item

2 个答案:

答案 0 :(得分:1)

您需要考虑样式标签style="text-decoration:line-through"是否适合您不想要的价格来进行刮制。

为此,您可以使用BeautifulSoup并考虑未交叉的价格没有样式标签:

from bs4 import BeautifulSoup as bs
import requests as r

response = r.get(url)
soup = bs(response.content)
decks = bs.find_all('td', {'class': 'deckdbbody', 'style': None})   

现在获取每个文本中的文本内容,即价格:

prices = [d.getText().strip() for d in decks]

随着您的更新,我可以看到您会在prices列表中得到不需要的东西,因为很多td使用此类,甚至都不是价格,一种简单的解决方法是检查是否您在.getText()中有一个美元符号:

final = []
for price in prices:
    if '$' in price:
        final.append(price)

现在final仅拥有您真正想要的东西。

答案 1 :(得分:0)

这才是最终工作

if item["Price"] == None:
    item["Price"] = game.css("td[class^=deckdbbody].search_results_9 span[style*='color:red']::text").get()