如何使用Selenium和Beautiful Soup在Python中更快地进行网络抓取?

时间:2020-04-24 19:45:16

标签: python selenium selenium-webdriver web-scraping beautifulsoup

我编写了一个脚本,使用Beautiful Soup和Selenium库删除了Vivino网站。

在这个网站上,我想存储某种葡萄酒的评论信息。

我必须使用Selenium进行动态抓取,因为只能按网页中的“显示更多评论”按钮来访问评论,该按钮在向下滚动到页面顶部后才会显示。

我只修改了一种葡萄酒的代码,因此您可以查看需要的时间:

import requests
from bs4 import BeautifulSoup
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd


def scroll_to_bottom_wine_page(driver):

    #driver = self.browser
    scroll_pause_time = 0.01 #Change time?
    # Get scroll height
    last_height = driver.execute_script("return document.body.scrollHeight")

    while True:
        # Scroll down to bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait to load page
        time.sleep(scroll_pause_time)

        # Calculate new scroll height and compare with last scroll height
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height


def scroll_to_bottom_review_page(driver, rating_count):

    stuck_counter = 0
    current_reviews_now = 0
    current_reviews_previous = 0
    scroll_review_pause_time = 0.8 #Change time?
    stop_indicator = rating_count 

    time.sleep(scroll_review_pause_time)
    element_inside_popup = driver.find_element_by_xpath('//*[@id="baseModal"]/div/div[2]/div[3]//a')  #Reviews path



    while True:
        time.sleep(scroll_review_pause_time)
        element_inside_popup.send_keys(Keys.END)
        results_temp = driver.execute_script("return document.documentElement.outerHTML")
        soup = BeautifulSoup(results_temp, 'lxml')    
        reviews = soup.findAll("div", {"class": "card__card--2R5Wh reviewCard__reviewCard--pAEnA"})
        current_reviews_now = len(reviews)

        #In case there actually are less reviews than what the rating_count states, we avoid scrolling down forever
        if(current_reviews_now == current_reviews_previous):
            stuck_counter += 1

        if (current_reviews_now > (stop_indicator)) or (stuck_counter > 2):
            break

        current_reviews_previous = current_reviews_now

    return reviews



def get_reviews(wine_ids, wine_urls, rating_counts):

    #Create a dataframe
    review_info = pd.DataFrame()

    #Create a driver
    driver = webdriver.Chrome()

    for wine_url in wine_urls:

        #Pass URL to driver
        driver.get(wine_url)

        #We scroll down to the bottom of the wine webpage
        scroll_to_bottom_wine_page(driver)

        #Search for the "Show more reviews button and click it
        wait = WebDriverWait(driver,40)
        wait.until(EC.element_to_be_clickable((By.LINK_TEXT, 'Show more reviews')))
        more_reviews_button = driver.find_element_by_link_text('Show more reviews')
        more_reviews_button.click()

        #Scroll till we reach the number of reviews 
        reviews = scroll_to_bottom_review_page(driver, rating_counts)
        length = len(reviews)

        wine_ids_list = [wine_ids] * length
        review_user_links = []
        review_ratings = []
        review_usernames = []
        review_dates = []
        review_texts = []
        review_likes_count = []
        review_comments_count = []

        for review in reviews:


            review_user_links.append([a['href'] for a in review.find_all('a', href=True)][0])
            review_ratings.append(float((review.find("div", class_="rating__rating--ZZb_x")["aria-label"]).split()[1]))
            review_usernames.append(str((review.find('a', {"class" : 'anchor__anchor--3DOSm reviewCard__userName--2KnRl'})).string))
            review_dates.append("".join(((review.find('div', {"class" : 'reviewCard__ratingsText--1LU2T'})).text).rsplit((str(review_usernames[-1])))))

            if (review.find('p', {"class" : 'reviewCard__reviewNote--fbIdd'})) is not None:
                review_texts.append(str((review.find('p', {"class" : 'reviewCard__reviewNote--fbIdd'})).string))
                review_texts = [item.strip() for item in review_texts]  
            else:
                review_texts.append('None')

            if (review.find("div", class_="likeButton__likeCount--82au4")) is not None:
                review_likes_count.append(int(review.find("div", class_="likeButton__likeCount--82au4").text))
            else:
                review_likes_count.append(int(0))

            if (review.find("div", class_="commentsButton__commentsCount--1_Ugm")) is not None:
                review_comments_count.append(int(review.find("div", class_="commentsButton__commentsCount--1_Ugm").text))
            else:
                review_comments_count.append(int(0))

        #We put the information in a dataframe
        review_info_temp = pd.DataFrame()

        review_info_temp.loc[:,'wine_id'] = wine_ids_list
        review_info_temp.loc[:,'review_user_links'] = review_user_links
        review_info_temp.loc[:,'review_ratings'] = review_ratings
        review_info_temp.loc[:,'review_usernames'] = review_usernames
        review_info_temp.loc[:,'review_dates'] = review_dates
        review_info_temp.loc[:,'review_texts'] = review_texts
        review_info_temp.loc[:,'review_likes_count'] = review_likes_count
        review_info_temp.loc[:,'review_comments_count'] = review_comments_count

        #We update the total dataframe
        review_info = pd.concat([review_info,review_info_temp], axis=0, ignore_index=True)

    #We close the driver
    driver.quit()

    return review_info


wine_id = ['123']
wine_url = ['https://www.vivino.com/vinilourenco-pai-horacio-grande-reserva/w/5154081?year=2015&price_id=21118981']
wine_rating_count = 186 

start_time = time.time()
reviews_info = get_reviews(wine_id, wine_url, wine_rating_count)
elapsed_time = time.time() - start_time
print('The scrap took: ', elapsed_time) #For this particular wine, the code took 38 seconds to run

我编写的脚本执行以下步骤:

1)通过特定的Wine链接(即https://www.vivino.com/vinilourenco-pai-horacio-grande-reserva/w/5154081?year=2015&price_id=21118981),我可以使用Selenium驱动程序访问该网页。

2)然后,我向下滚动到网页底部。

3)我找到并单击“显示更多评论”按钮

4)按下此按钮后,将弹出一个带有酒评的页面

5)我在这些弹出窗口中向下滚动,直到达到一定数量的评论

6)我从评论中提取所需信息(每个评论都是美丽汤的汤对象)

问题在于,如果我想废弃数千种葡萄酒的评论信息,那将永远花费。对于具有99条评论的单一葡萄酒,此过程需要35秒。

有什么办法可以加快这个过程?

2 个答案:

答案 0 :(得分:0)

这些评论来自其api:

import requests
agent = {"User-Agent":'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.113 Safari/537.36'}
response = requests.get('https://www.vivino.com/api/wines/5154081/reviews?year=2015&per_page=100', headers=agent)
reviews = response.json()["reviews"]
print(reviews)

答案 1 :(得分:0)

我的建议是不要使用硒。硒应该是您抓取网页的最后选择。相反,学习了解网页如何使用Web浏览器开发工具发出请求。例如,对于您发布的网页,这是您可以在其中检索注释的URL:https://www.vivino.com/api/wines/5154081/reviews?year=2015&per_page=10&page=1

他们有一个API !!这样很容易刮擦。

您只需要 <rewrite url="^~/en/" to="~/en/Default.aspx"/> 甚至是BeautifulSoup。

requests

答案如下:

headers = {"pragma": "no-cache",
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"sec-fetch-site": "same-origin",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36",
"x-requested-with": "XMLHttpRequest"}

url = "https://www.vivino.com/api/wines/5154081/reviews?year=2015&per_page=10&page=1"

resp = requests.get(url, headers=headers)
resp.json()