我在python中有一个代码,以便从旅行顾问中获取一些数据(来自评论的评级)。问题是每当我运行代码时它会给我不同的行,并且永远不会丢弃所有的网页。
出现的索引错误是:
Traceback (most recent call last):
File "C:/Users/thimios/PycharmProjects/TripadvisorScrapping/proxiro.py", line 26, in <module>
rating = soup.findAll("div", {'class': 'rating reviewItemInline'})[i]
IndexError: list index out of range
代码如下:
from bs4 import BeautifulSoup
import os
import urllib.request
file2 = open(os.path.expanduser(r"~/Desktop/TripAdviser Reviews2.csv"), "wb")
file2.write(b"Organization,Rating" + b"\n")
WebSites = [
"https://www.tripadvisor.com/Hotel_Review-g189400-d198932-Reviews-Hilton_Athens-Athens_Attica.html#REVIEWS"]
Checker ="REVIEWS"
# looping through each site until it hits a break
for theurl in WebSites:
thepage = urllib.request.urlopen(theurl)
soup = BeautifulSoup(thepage, "html.parser")
#print(soup)
while True:
# Extract ratings from the text reviews
altarray = ""
for i in range(0,10):
rating = soup.findAll("div", {'class': 'rating reviewItemInline'})[i]
rating1 = rating.find_all("span")[0]
rating2 = rating1['class'][1][-2:]
print(rating2)
if len(altarray) == 0:
altarray = [rating2]
else:
altarray.append(rating2)
#print(altarray)
#print(len(altarray))
#print(type(altarray))
# Extract Organization,
Organization1 = soup.find(attrs={'class': 'heading_name'})
Organization = Organization1.text.replace('"', ' ').replace('Review of',' ').strip()
#print(Organization)
# Loop through each review on the page
for x in range(0, 10):
Rating = altarray[x]
Rating = str(Rating)
#print(Rating)
#print(type(Rating))
Record2 = Organization + "," + Rating
if Checker == "REVIEWS":
file2.write(bytes(Record2, encoding="ascii", errors='ignore') + b"\n")
link = soup.find_all(attrs={"class": "nav next rndBtn ui_button primary taLnk"})
#print(link)
#print(link[0])
if len(link) == 0:
break
else:
soup = BeautifulSoup(urllib.request.urlopen("http://www.tripadvisor.com" + link[0].get('href')),"html.parser")
#print(soup)
#print(Organization)
print(link[0].get('href'))
Checker = link[0].get('href')[-7:]
#print(Checker)
file2.close()
我认为旅行顾问不能完全访问数据。任何想法?
答案 0 :(得分:0)
当您尝试按索引访问列表中的元素并且该索引不存在时,会遇到错误。
我已经运行了你的代码并打印出来:
50
50
50
50
50
50
40
40
40
50
虽然,循环的方式并不是最常用的方式,也容易受到很多索引错误的影响。
你能做的就是替换它:
for i in range(0,10):
rating = soup.findAll("div", {'class': 'rating reviewItemInline'})[i]
with:
for rating in soup.findAll("div", {'class': 'rating reviewItemInline'}) :
这也将解决错误。