我正在尝试删除Backcountry.com的“评论”部分。该网站使用“动态加载更多”部分,即,当您要加载更多评论时,URL不会更改。我正在使用Selenium Webdriver与加载更多评论的按钮进行交互,并使用BeautifulSoup刮取评论。
我能够与“加载更多”按钮成功进行交互,并加载所有可用的评论。在您尝试加载更多按钮之前,我还可以抓取显示的初始评论。
摘要:我可以与“加载更多”按钮进行交互,可以刮擦可用的初始评论,但是在我全部加载后无法刮擦所有可用的评论。
我尝试更改html标签以查看是否有所不同。我试图增加睡眠时间,以防刮板没有足够的时间完成其工作。
# URL and Request code for BeautifulSoup
url_filter_bc = 'https://www.backcountry.com/msr-miniworks-ex-ceramic-water-filter?skid=CAS0479-CE-ONSI&ti=U2VhcmNoIFJlc3VsdHM6bXNyOjE6MTE6bXNy'
res_filter_bc = requests.get(url_filter_bc, headers = {'User-agent' : 'notbot'})
# Function that scrapes the reivews
def scrape_bc(request, website):
newlist = []
soup = BeautifulSoup(request.content, 'lxml')
newsoup = soup.find('div', {'id': 'the-wall'})
reviews = newsoup.find('section', {'id': 'wall-content'})
for row in reviews.find_all('section', {'class': 'upc-single user-content-review review'}):
newdict = {}
newdict['review'] = row.find('p', {'class': 'user-content__body description'}).text
newdict['title'] = row.find('h3', {'class': 'user-content__title upc-title'}).text
newdict['website'] = website
newlist.append(newdict)
df = pd.DataFrame(newlist)
return df
# function that uses Selenium and combines that with the scraper function to output a pandas Dataframe
def full_bc(url, website):
driver = connect_to_page(url, headless=False)
request = requests.get(url, headers = {'User-agent' : 'notbot'})
time.sleep(5)
full_df = pd.DataFrame()
while True:
try:
loadMoreButton = driver.find_element_by_xpath("//a[@class='btn js-load-more-btn btn-secondary pdp-wall__load-more-btn']")
time.sleep(2)
loadMoreButton.click()
time.sleep(2)
except:
print('Done Loading More')
# full_json = driver.page_source
temp_df = pd.DataFrame()
temp_df = scrape_bc(request, website)
full_df = pd.concat([full_df, temp_df], ignore_index = True)
time.sleep(7)
driver.quit()
break
return full_df
我希望有113行三列的pandas数据框。 我正在获取具有18行三列的pandas datafram。
答案 0 :(得分:0)
好的,您单击了loadMoreButton
,并加载了更多评论。但是,您仍将与您下载一次相同的scrape_bc
内容提供给request
,完全与Selenium分开。
将requests.get(...)
替换为driver.page_source
,并确保在driver.page_source
调用之前循环中scrape_bc(...)
request = driver.page_source
temp_df = pd.DataFrame()
temp_df = scrape_bc(request, website)