Question

我正在学习python。我正在尝试编写一个脚本，该脚本将从网页上某些表格的某些单元格中抓取关键数据，而忽略我不感兴趣的其他单元格。

到目前为止，我编写的脚本会收集表的前两行，但是随后会引发错误：

 Traceback (most recent call last):
  File "/home/Scripts/scraper.py", line 36, in <module>
    mp3 = mp3_container[0]['href']
IndexError: list index out of range

这是我到目前为止编写的代码：

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'https://XXX'

# opening up connecting,grabbing page
uClient = uReq(my_url)
url_html = uClient.read()
uClient.close()

# html parser
url_soup = soup(url_html, "html.parser")

#download table
url_data = {}
url_table = url_soup.table
url_table_data = url_table.tbody.find_all("tr")

 
url_t_d = url_table_data[0]

#template for exacting and printing data

for url_t_d in url_table_data:
    artist_container = url_t_d.find_all("td", {"class":"artist"})
    artist = artist_container[0].text
    
    title_container = url_t_d.find_all("td", {"class":"title"})
    title = title_container[0].text
    
    year_container = url_t_d.find_all("td", {"class":"year"})
    year = year_container[0].text
    
    mp3_container = url_t_d.find_all("a", {"title":"MP3 sample"})
    mp3 = mp3_container[0]['href']
    
    article_container = url_t_d.find_all("td", {"class":"articleListInfo"})
    article_link =article_container[0].a['href']
    
    print("Artist: " + artist)
    print("Title: " + title)
    print("year: " + year)
    print("mp3: "+ mp3)
    print("link: " + article_link)

有人可以建议我可能要去哪里哪里吗？谢谢

Answer 1

我通过在try / except中的“ for”语句中包装每一行来解决了该问题，例如：

try:
    mp3_container = url_t_d.find_all("a", {"title":"MP3 sample"})
    mp3 = mp3_container[0]['href']
except:
    mp3 = "none"

IndexError-Python抓取脚本

1 个答案: