Question

Web爬网的新手，不确定是否需要执行其他操作。我正在尝试使用python和beautifulsoup抓取网页，以收集所有评论/用户ID和评分。

链接-https://play.google.com/store/apps/details?id=com.amazon.avod.thirdpartyclient&hl=en_US

import requests
from bs4 import BeautifulSoup

source = requests.get('https://play.google.com/store/apps/details?id=com.amazon.avod.thirdpartyclient&hl=en_US').text

soup = BeautifulSoup(source, 'lxml')
review = soup.find('div')
print(review.prettify())

在这行之后，当我尝试查找评论时。它返回None。我不确定我在做什么错。

Answer 1

在进行抓取之前，最好先查看page source。如果您没有在页面源代码中看到您的标签，则您没有其他选择。您可以尝试查看脚本标签，在很多情况下可以在其中找到数据（但是您需要执行一些字符串操作才能获取它们），或者在其他情况下可以检查网页在加载期间发出请求的情况-这可能会有所帮助您会找到裸露的api点，可用于获取数据。或者，如果您未在脚本标签或任何裸露的api点中找到数据，则无法使用硒执行javascript（另一种选择是将浏览器与自定义插件一起使用-可帮助您控制浏览器）

在您的情况下，您的评论数据在脚本标签中（浏览器中的javascript将生成带有评论的div标签，但是要执行javascript，您需要使用自定义的书面插件或硒或浏览器）。脚本标签更多，但是由于以下字符串，您可以使用评论数据标识脚本标签：https://lh3.googleusercontent.com/a-（提供评论的用户链接的一部分）。

在您的情况下，这样的方法可能有效：

import requests
import json
from bs4 import BeautifulSoup

r = requests.get('https://play.google.com/store/apps/details?id=com.amazon.avod.thirdpartyclient&hl=en_US')
soup = BeautifulSoup(r.text)
scripts = soup.findAll('script')
review_script = None
for script in scripts:
    if "https://lh3.googleusercontent.com/a-" in script.text:
        review_script = script.text
        break
#in script tag there is function which returns data which you want
#there are few parts which needs to be removed to get nested list with data
data = json.loads(review_script.split('return')[-1].split('}});')[0]) #}});
#here are review data in nested list needs some playing with it
reviews = data[0]
#modify as needed
for review in reviews:
    user = review[1][0]
    review = review[4]
    print(user, review)

在网页抓取方面需要帮助

1 个答案: