硒刮网时的分页帮助

时间:2020-04-11 21:41:59

标签: python python-3.x selenium web-scraping

我是Selenium和Python的新手。我试图在不同酒店的页面上进行分页,以刮除评论者的姓名和评论的等级。我写了以下脚本,它只能在一页上工作,但是当我添加用于分页的代码时,它会中断,我不确定这可能是问题所在。预先感谢。

driver = webdriver.Chrome(chromedriver)
driver.get("https://www.tripadvisor.com/Hotels-g60763-New_York_City_New_York-Hotels.html")
driver.maximize_window()
driver.implicitly_wait(10)
soup = BeautifulSoup(driver.page_source, 'html.parser')


headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}
list_rating = []
list_users = []
domain = "https://www.tripadvisor.com/"
list_urls = [domain + i.attrs['href'] for i in soup.findAll('a',class_="review_count")]

for i in list_urls:
    find_numbers = re.findall(r'[0-9]+', i)
    find_name_hotel = re.findall('Reviews-.*', i)
    for u in range(0,10,5): 
            url = i[:50] + find_numbers[1] + '-Reviews-' + 'or' + str(u) + find_name_hotel[0][7:]


            driver.get(url)
            time.sleep(5)
            element_list = driver.find_elements_by_xpath("//span[@class='taLnk ulBlueLinks']")
            for e in element_list:
                try:
                        e.click()
                except:
                        pass


            # The code above works, but when I add the code below it breaks
            html = driver.page_source
            response = requests.get(url, headers=headers ,verify=False).text
            soup = BeautifulSoup(driver.page_source, 'html.parser')
            for r in soup.find_all('div', 'reviewSelector'):

                rating = int(r.find('span','ui_bubble_rating')['class'][1].split('_')[1])/10
                list_rating.append(rating)
            users = driver.find_elements_by_xpath("//a[@class='ui_header_link social-member-event-MemberEventOnObjectBlock__member--35-jC']")
            for i in users:
                list_users.append(i.text)


print(list_rating)                    
print(list_users)

这是我得到的错误。

<ipython-input-5-39d08f981b2e> in <module>
      9     find_name_hotel = re.findall('Reviews-.*', i)
     10     for u in range(0,10,5):
---> 11             url = i[:50] + find_numbers[1] + '-Reviews-' + 'or' + str(u) + find_name_hotel[0][7:]
     12 
     13 

TypeError: 'WebElement' object is not subscriptable

1 个答案:

答案 0 :(得分:4)

您将在此代码块中覆盖变量i

 for i in users:
            list_users.append(i.text)

通过为变量使用专有名称而不是i,可以避免此类错误:

for user in users:
            list_users.append(user.text)