用Python刮痧

时间:2017-03-24 12:15:27

标签: python web-scraping

我在python中有一个代码,以便从旅行顾问中获取一些数据(来自评论的评级)。问题是每当我运行代码时它会给我不同的行,并且永远不会丢弃所有的网页。

出现的索引错误是:

Traceback (most recent call last):
  File "C:/Users/thimios/PycharmProjects/TripadvisorScrapping/proxiro.py", line 26, in <module>
    rating = soup.findAll("div", {'class': 'rating reviewItemInline'})[i]
IndexError: list index out of range

代码如下:

from bs4 import BeautifulSoup
import os
import urllib.request

file2 = open(os.path.expanduser(r"~/Desktop/TripAdviser Reviews2.csv"), "wb")        
file2.write(b"Organization,Rating" + b"\n")

WebSites = [
"https://www.tripadvisor.com/Hotel_Review-g189400-d198932-Reviews-Hilton_Athens-Athens_Attica.html#REVIEWS"]

Checker ="REVIEWS"

# looping through each site until it hits a break
for theurl in WebSites:
    thepage = urllib.request.urlopen(theurl)
    soup = BeautifulSoup(thepage, "html.parser")
    #print(soup)

    while True:
        # Extract ratings from the text reviews
        altarray = ""
        for i in range(0,10):
            rating = soup.findAll("div", {'class': 'rating reviewItemInline'})[i]
            rating1 = rating.find_all("span")[0]
            rating2 = rating1['class'][1][-2:]
            print(rating2)
            if len(altarray) == 0:
                altarray = [rating2]
            else:
                altarray.append(rating2)

            #print(altarray)
            #print(len(altarray))
            #print(type(altarray))

            # Extract Organization,
            Organization1 = soup.find(attrs={'class': 'heading_name'})
            Organization = Organization1.text.replace('"', ' ').replace('Review of',' ').strip()
            #print(Organization)



            # Loop through each review on the page
            for x in range(0, 10):
                Rating = altarray[x]
                Rating = str(Rating)
                #print(Rating)
                #print(type(Rating))

                Record2 = Organization + "," + Rating
                if Checker == "REVIEWS":
                    file2.write(bytes(Record2, encoding="ascii", errors='ignore') + b"\n")

                link = soup.find_all(attrs={"class": "nav next rndBtn ui_button primary taLnk"})
                #print(link)
                #print(link[0])
                if len(link) == 0:
                    break
                else:
                   soup = BeautifulSoup(urllib.request.urlopen("http://www.tripadvisor.com" + link[0].get('href')),"html.parser")
                   #print(soup)
                   #print(Organization)
                   print(link[0].get('href'))
                   Checker = link[0].get('href')[-7:]
                   #print(Checker)

        file2.close()

我认为旅行顾问不能完全访问数据。任何想法?

1 个答案:

答案 0 :(得分:0)

当您尝试按索引访问列表中的元素并且该索引不存在时,会遇到错误。

我已经运行了你的代码并打印出来:

50
50
50
50
50
50
40
40
40
50

虽然,循环的方式并不是最常用的方式,也容易受到很多索引错误的影响。

你能做的就是替换它:

for i in range(0,10):
    rating = soup.findAll("div", {'class': 'rating reviewItemInline'})[i]

with:

for rating in soup.findAll("div", {'class': 'rating reviewItemInline'}) :

这也将解决错误。