硒
在同一URL(SPA)下有五个注释列表,通过单击按钮可以动态加载
如果需要,该网站为https://www.icourse163.org/course/PKU-1205962805,但它是中文。
page1:
<page1 comment list>
page2:
<page2 comment list>
page3:
<page3 comment list>
page4:
<page4 comment list>
page5:
<page5 comment list>
我得到什么
page1:
<page2 comment list>
page2:
<page3 comment list>
page3:
<page3 comment list>
page4:
<page4 comment list>
page5:
<page5 comment list>
码
import requests
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.webdriver import WebDriver
from selenium.webdriver.chrome.options import Options
'''
API:
dict get_comment(url:string)
return dict of all comments:string
'''
def get_comment(url):
'''
div.ux-mooc-comment-course-comment_comment-list_item div.ux-mooc-comment-course-comment_comment-list_item_body_content
'''
driver = WebDriver()
driver.get(url)
driver.find_element_by_id("review-tag-button").click()
# 1,2,3,..button
comment_page_btns = driver.find_elements_by_class_name("th-bk-main-gh")
page = 1
file = open("comment.txt","w")
for btn in comment_page_btns:
btn.click()
soup = BeautifulSoup(driver.page_source,"lxml")
#comment list for one subpage
comment_tag_list = soup.select("div.ux-mooc-comment-course-comment_comment-list_item div.ux-mooc-comment-course-comment_comment-list_item_body_content")
comment_count = len(comment_tag_list)
print("in:",page," comment count: ", comment_count)
index = "page"+str(page)+"\n"
file.write(index)
for tag in comment_tag_list:
text = tag.get_text().rstrip().lstrip()+"\n"
file.write(text.encode("utf-8"))
page = page+1
file.close()
driver.quit()
if __name__ == "__main__":
get_comment("https://www.icourse163.org/course/PKU-1205962805")
答案 0 :(得分:0)
这可能有点晚了。这是我从您那里更改的一些代码,因此不会发生。基本上,您在阅读评论之前单击了下一个按钮。我还更改了代码,以便消除了beautifulsoup。
import time
from selenium.webdriver.chrome.webdriver import WebDriver
from selenium.webdriver.chrome.options import Options
'''
API:
dict get_comment(url:string)
return dict of all comments:string
'''
def get_comment(url):
'''
div.ux-mooc-comment-course-comment_comment-list_item div.ux-mooc-comment-course-comment_comment-list_item_body_content
'''
driver = WebDriver()
driver.get(url)
driver.find_element_by_id("review-tag-button").click()
# 1,2,3,..button
page = 1
file = open("comment.txt","w", encoding = 'utf-8')
for x in range(15):
time.sleep(0.5)
#comment list for one subpage
comment_tag_list = driver.find_elements_by_xpath("//div[@class='ux-mooc-comment-course-comment_comment-list_item_body_content']/span")
comment_count = len(comment_tag_list)
print("in:",page," comment count: ", comment_count)
index = "page"+str(page)+"\n"
file.write(index)
for tag in comment_tag_list:
text = tag.text.rstrip().lstrip()+"\n"
file.write(text)
page = page+1
btn = driver.find_element_by_xpath("//a[.='下一页']")
btn.click()
file.close()
driver.quit()
if __name__ == "__main__":
get_comment("https://www.icourse163.org/course/PKU-1205962805")