Question

我正在尝试从该评论网站抓取数据。它首先浏览第一页，检查是否有第二页，然后也转到它。问题是到达第二页时。页面需要时间更新，我仍然得到第一页的数据而不是第二页

例如，如果您转到here，您将看到加载第 2 页数据需要多少时间

我尝试设置超时或睡眠，但没有成功。更喜欢具有最小包/浏览器依赖性（如 webdriver.PhantomJS()）的解决方案，因为我需要在我雇主的环境中运行此代码，但不确定是否可以使用它。谢谢！！

from urllib.request import Request, urlopen
from time import sleep
from socket import timeout
    
req = Request(softwareadvice, headers={'User-Agent': 'Mozilla/5.0'})
    
web_byte = urlopen(req, timeout=10).read()
    
webpage = web_byte.decode('utf-8')
parsed_html = BeautifulSoup(webpage, features="lxml")
    
true=parsed_html.find('div', {'class':['Grid-cell--1of12 pagination-arrows pagination-arrows-right']})
    
while(true):
                                       
    true = parsed_html.find('div', {'class':['Grid-cell--1of12 pagination-arrows pagination-arrows-right']})

    if(not True):
       true=False
    else: 
       req = Request(softwareadvice+'?review.page=2', headers=hdr)
       sleep(10)
       webpage = urlopen(req, timeout=10)
       sleep(10)
       webpage = webpage.read().decode('utf-8')
       parsed_html = BeautifulSoup(webpage, features="lxml")

Answer 1

评论是通过 Ajax 请求从外部源加载的。你可以使用这个例子来加载它们：

import re
import json
import requests
from bs4 import BeautifulSoup


url = "https://www.softwareadvice.com/sms-marketing/twilio-profile/reviews/"
api_url = (
    "https://pkvwzofxkc.execute-api.us-east-1.amazonaws.com/production/reviews"
)

params = {
    "q": "s*|-s*",
    "facet.gdm_industry_id": '{"sort":"bucket","size":200}',
    "fq": "(and product_id: '{}' listed:1)",
    "q.options": '{"fields":["pros^5","cons^5","advice^5","review^5","review_title^5","vendor_response^5"]}',
    "size": "50",
    "start": "50",
    "sort": "completeness_score desc,date_submitted desc",
}

# get product id
soup = BeautifulSoup(requests.get(url).content, "html.parser")
a = soup.select_one('a[href^="https://reviews.softwareadvice.com/new/"]')
id_ = int("".join(re.findall(r"\d+", a["href"])))

params["fq"] = params["fq"].format(id_)

for start in range(0, 3):  # <-- increase the number of pages here
    params["start"] = 50 * start

    data = requests.get(api_url, params=params).json()

    # uncomment this to print all data:
    # print(json.dumps(data, indent=4))

    # print some data:
    for h in data["hits"]["hit"]:
        if "review" in h["fields"]:
            print(h["fields"]["review"])
            print("-" * 80)

打印：

After 2 years using Twilio services, mainly phone and messages, I can say I am so happy I found this solution to handle my communications. It is so flexible,  Although it has been a little bit complicated sometimes to self-learn about online phoning systems it saved me from a lot of hassles I wanted to avoid. The best benefit you get is the ultra efficient support service
--------------------------------------------------------------------------------
An amazingly well built product -- we rarely if ever had reliability issues -- the Twilio Functions were an especially useful post-purchase feature discovery -- so much so that we still use that even though we don't do any texting.  We also sometimes use FracTEL, since they beat Twilio on pricing 3:1 for 1-800 texts *and* had MMS 1-800 support long before Twilio. 
--------------------------------------------------------------------------------
I absolutely love using Twilio, have had zero issues in using the SIP and text messaging on the platform.
--------------------------------------------------------------------------------
Authy by Twilio is a run-of-the-mill 2FA app. There's nothing special about it. It works when you're not switching your hardware.
--------------------------------------------------------------------------------
We've had great experience with Twilio. Our users sign up for text notification and we use Twilio to deliver them information. That experience has been well-received by customers. There's more to Twilio than that but texting is what we use it for. The system barely ever goes down and always shows us accurate information of our usage.
--------------------------------------------------------------------------------

...and so on.

Answer 2

我一直在抓取多种类型的网站，我认为在抓取的世界中，大约有两种类型的网站。

第一个是“基于URL的”网站（即您发送带有URL的请求，服务器以HTML标签响应，可以直接从中提取元素），第二个是“JavaScript 渲染” 网站（即您只得到的响应是 javascript，并且您只能在运行后看到 HTML 标签）。

在前者的情况下，您可以使用 bs4 自由浏览网站。但在后者的情况下，您不能总是使用 URL 作为经验法则。

您要抓取的网站是使用基于客户端呈现的 Angular.js 构建的。因此，您得到的响应是 JavaScript 代码，而不是包含页面内容的 HTML 标记。您必须运行代码才能获取内容。

关于你介绍的代码：

req = Request(softwareadvice, headers={'User-Agent': 'Mozilla/5.0'})
    
web_byte = urlopen(req, timeout=10).read() # response is javascript, not page content you want...
    
webpage = web_byte.decode('utf-8')

您所能获得的只是必须运行才能获得 HTML 元素的 JavaScript 代码。这就是为什么你每次都得到相同的页面（响应）。

那么，怎么办？有没有办法在 bs4 中运行 JavaScript？我想没有任何合适的方法可以做到这一点。您可以为此使用硒。您可以从字面上等到页面完全加载，您可以点击按钮和锚点，或随时获取页面内容。

selenium 中的无头浏览器可能会工作，这意味着您不必看到计算机上打开的受控浏览器。

这里有一些可能对您有帮助的链接。

scrape html generated by javascript with python

https://sadesmith.com/2018/06/15/blog/scraping-client-side-rendered-data-with-python-and-selenium

感谢阅读。

在使用 Beatifulsoup 抓取之前等待

2 个答案: