我现在正在创建一个网络抓取工具,我想从imdb中抓取用户评论。从原始页面直接获取10条评论和评分很容易。例如http://www.imdb.com/title/tt1392170/reviews问题是要获得所有评论,我需要按下"加载更多"这样,当网址不会改变时,会显示更多评论!所以我不知道如何在Python3中获得所有评论。我现在使用的是请求,bs4。
我的代码现在:
from urllib.request import urlopen, urlretrieve
from bs4 import BeautifulSoup
url_link='http://www.imdb.com/title/tt0371746/reviews?ref_=tt_urv'
html=urlopen(url_link)
content_bs=BeautifulSoup(html)
for b in content_bs.find_all('div',class_='text'):
print(b)
for rate_score in content_bs.find_all('span',class_='rating-other-user-rating'):
print(rate_score)
答案 0 :(得分:1)
您无法在不启动点击事件的情况下按下更多加载按钮。但是,BeautifulSoup
没有该属性。但是,你可以做些什么来获得完整的内容,就像我在下面演示的那样。它会随review title
一起提取所有reviews
:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = 'http://www.imdb.com/title/tt0371746/reviews?ref_=tt_urv'
res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
main_content = urljoin(url,soup.select(".load-more-data")[0]['data-ajaxurl']) ##extracting the link leading to the page containing everything available here
response = requests.get(main_content)
broth = BeautifulSoup(response.text,"lxml")
for item in broth.select(".review-container"):
title = item.select(".title")[0].text
review = item.select(".text")[0].text
print("Title: {}\n\nReview: {}\n\n".format(title,review))