我正在学习python。我正在尝试编写一个脚本,该脚本将从网页上某些表格的某些单元格中抓取关键数据,而忽略我不感兴趣的其他单元格。
到目前为止,我编写的脚本会收集表的前两行,但是随后会引发错误:
Traceback (most recent call last):
File "/home/Scripts/scraper.py", line 36, in <module>
mp3 = mp3_container[0]['href']
IndexError: list index out of range
这是我到目前为止编写的代码:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://XXX'
# opening up connecting,grabbing page
uClient = uReq(my_url)
url_html = uClient.read()
uClient.close()
# html parser
url_soup = soup(url_html, "html.parser")
#download table
url_data = {}
url_table = url_soup.table
url_table_data = url_table.tbody.find_all("tr")
url_t_d = url_table_data[0]
#template for exacting and printing data
for url_t_d in url_table_data:
artist_container = url_t_d.find_all("td", {"class":"artist"})
artist = artist_container[0].text
title_container = url_t_d.find_all("td", {"class":"title"})
title = title_container[0].text
year_container = url_t_d.find_all("td", {"class":"year"})
year = year_container[0].text
mp3_container = url_t_d.find_all("a", {"title":"MP3 sample"})
mp3 = mp3_container[0]['href']
article_container = url_t_d.find_all("td", {"class":"articleListInfo"})
article_link =article_container[0].a['href']
print("Artist: " + artist)
print("Title: " + title)
print("year: " + year)
print("mp3: "+ mp3)
print("link: " + article_link)
有人可以建议我可能要去哪里哪里吗?谢谢
答案 0 :(得分:0)
我通过在try / except中的“ for”语句中包装每一行来解决了该问题,例如:
try:
mp3_container = url_t_d.find_all("a", {"title":"MP3 sample"})
mp3 = mp3_container[0]['href']
except:
mp3 = "none"