滚动后,使用Selenium和BeautifulSoup进行的网络抓取不会更新提取的代码

时间:2018-12-05 10:47:41

标签: python selenium-webdriver web-scraping beautifulsoup selenium-chromedriver

我正在尝试刮除Steam上某些游戏的评论。除非您滚动到页面底部,否则评论页面上只有10条评论可用,并且将加载更多评论。 我使用硒滚动,但是BeautifulSoup对象(预计包含20条评论)仍然只有10条。 这是我的代码:

from bs4 import BeautifulSoup
from selenium import webdriver
import time

driver = webdriver.Chrome('E:\Download\chromedriver.exe')
driver.get('https://steamcommunity.com/app/466560/reviews/?browsefilter=toprated&snr=1_5_100010_')
SCROLL_PAUSE_TIME = 0.5
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2)
soup = BeautifulSoup(driver.page_source)

我该如何解决?

2 个答案:

答案 0 :(得分:1)

您需要等到元素ID action_wait不可见为止,如果没有更多评论,则可以查找文本,或者只需设置所需的最大评论即可。

在此示例中,结果限制为100,可以将其增加,但是如果您不想等待更长的时间Ctrl + C,数据将被处理为beautifulsoup。

driver.get('https://.....')
maxResult = 100
currentResults = 0
pageSource = ''

try:
    print('press "Ctrl + C" to stop loop and process using beautfulsoup.')
    while currentResults < maxResult:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        WebDriverWait(driver, 10).until(EC.invisibility_of_element_located((By.ID, "action_wait")))
        currentResults = len(driver.find_elements_by_css_selector('.apphub_Card.modalContentLink.interactable'))
        print('currentResults: %s' % currentResults)
        pageSource = driver.page_source
except KeyboardInterrupt:
        print "Cancelled by user"
except: pass

soup = BeautifulSoup(pageSource, 'html.parser')

reviews = soup.select('.apphub_Card.modalContentLink.interactable')

print('reviews count by BeautifulSoup: %s' % len(reviews))

答案 1 :(得分:1)

页面使用jquery更新,每滚动10条记录。每次偏移都会得到下一组。列表用尽时会显示文本,您可以使用它滚动到结尾。如果您想在任何特定点停下,则将循环退出条件设为len(d.find_elements_by_css_selector('.reviewInfo'))

所需的评论数量
from selenium import webdriver

d  = webdriver.Chrome()
url = 'https://steamcommunity.com/app/466560/reviews/?browsefilter=toprated&snr=1_5_100010_'
d.get(url)

while d.find_element_by_css_selector('.apphub_NoMoreContentText1').text != 'No more content. So sad.':
    d.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    try: 
        d.find_element_by_id('GetMoreContentBtn').click()
    except:
        pass
print(len(d.find_elements_by_css_selector('.reviewInfo')))  #6135