Web爬网列表索引超出范围

时间:2018-12-06 07:19:29

标签: python web-scraping

import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = "https://www.amazon.in/s/ref=sr_nr_p_36_4?fst=as%3Aoff&rh=n%3A976419031%2Cn%3A1389401031%2Cn%3A1389432031%2Ck%3Amobile%2Cp_36%3A1318507031&keywords=mobile&ie=UTF8&qid=1543902909&rnid=1318502031"
uClient = uReq(my_url)
raw_html= uClient.read()
uClient.close()

page_soup = soup(raw_html, "html.parser")
containers = page_soup.findAll("div",{"class":"s-item-container"})

filename = "Product.csv"
f = open (filename , "w")

headers = "Name,Price,Prime \n"
f.write(headers)

for container in containers:

    title_container = container.findAll("div",{"class":"a-row a-spacing-mini"})
    product_name = title_container[0].div.a.h2.text

    price = container.findAll("span",{"class":"a-size-small a-color-secondary a-text-strike"})
    product_price = price[0].text.strip()

    prime = container.findAll("i",{"class":"a-icon a-icon-prime a-icon-small s-align-text-bottom"})
    product_prime = prime[0].text

    print("product_name : " + product_name)
    print("product_price : " + product_price)
    print("product_prime : " + product_prime)

    f.write(product_name + "," + product_price + "," + product_prime + "\n") 
f.close

我写了我的第一个Web爬虫代码,但是由于某种原因,它只循环了4次,并显示了一个错误消息,消息为(文件“ firstwebscrapping.py”,第23行, product_price = price [0] .text.strip() IndexError:列表索引超出范围)。 请,有人可以解释我做错了什么地方吗?

2 个答案:

答案 0 :(得分:0)

并非每个container都有<span class="a-size-small a-color-secondary a-text-strike">

因此,当您找到这些元素时:

price = container.findAll("span",{"class":"a-size-small a-color-secondary a-text-strike"})

没有找到元素-price是一个空列表。在下一行中,您访问price的fist元素:

product_price = price[0].text.strip()

由于price为空,因此您将收到错误消息IndexError: list index out of range

例如,我在代码中的链接上的页面上有这样的元素:

enter image description here

您选择删除线价格,但 OnePlus 6T 没有。它只有<span class="a-size-base a-color-price s-price a-text-bold">

您可以检查price是否为空,如果是,则可以在上面的span中搜索价格。

答案 1 :(得分:0)

第一个问题是并非每个商品都有原始价格和当前价格,因此您可以修改此代码。

来自"class":"a-size-small a-color-secondary a-text-strike"

"class":"a-size-base a-color-price s-price a-text-bold"

此代码还会引发另一个问题

containers = target[0].findAll("div",{"class":"s-item-container"})

s-item-container不仅在ajaxData中,而且在atfResults中,因此我们使用select函数使用此代码target = page_soup.select('div#atfResults')来获取目标div列表,希望这可以解决您的问题。

div#search-main-wrapper> div#ajaxData> s-item-container div#search-main-wrapper> div#atfResults> s-item-container

import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = "https://www.amazon.in/s/ref=sr_nr_p_36_4?fst=as%3Aoff&rh=n%3A976419031%2Cn%3A1389401031%2Cn%3A1389432031%2Ck%3Amobile%2Cp_36%3A1318507031&keywords=mobile&ie=UTF8&qid=1543902909&rnid=1318502031"
uClient = uReq(my_url)
raw_html= uClient.read()
uClient.close()

page_soup = soup(raw_html, "html.parser")

target = page_soup.select('div#atfResults')
containers = target[0].findAll("div",{"class":"s-item-container"})

filename = "Product.csv"
f = open (filename , "w")

headers = "Name,Price,Prime \n"
f.write(headers)
print(len(containers))
for container in containers:

    title_container = container.findAll("div",{"class":"a-row a-spacing-mini"})
    product_name = title_container[0].div.a.h2.text

    price = container.findAll("span",{"class":"a-size-base a-color-price s-price a-text-bold"})
    product_price = price[0].text.strip()

    prime = container.findAll("i",{"class":"a-icon a-icon-prime a-icon-small s-align-text-bottom"})
    product_prime = prime[0].text

    print("product_name : " + product_name)
    print("product_price : " + product_price)
    print("product_prime : " + product_prime)

    f.write(product_name + "," + product_price + "," + product_prime + "\n") 
f.close()