在Patreon上使用bs4进行Python网络抓取

时间:2020-07-18 15:21:34

标签: python html selenium web-scraping beautifulsoup

我编写了一个脚本,该脚本可以查找一些博客,并查看是否已添加新帖子。但是,当我尝试在Patreon上执行此操作时,我无法使用bs4找到正确的元素。

让我们以https://www.patreon.com/cubecoders为例。

说我想在“成为赞助人”部分下获得独家帖子的数量,截止目前为25。

此代码可以正常工作:

import requests
from bs4 import BeautifulSoup

plain_html = requests.get("https://www.patreon.com/cubecoders").text
full_html = BeautifulSoup(plain_html, "html.parser")
text_of_newest_post = full_html.find("div", class_="sc-AxjAm fXpRSH").text
print(text_of_newest_post)

Output: 25

现在,我想获取最新帖子的标题,即“ AMP 2.0.2中的新增功能-集成SCP / SFTP服务器!”截至目前。 我在浏览器中检查了标题,并发现该标题包含在带有“ sc-1di2uql-1 vYcWR”类的span标记中。

但是,当我尝试运行此代码时,无法获取该元素:

import requests
from bs4 import BeautifulSoup

plain_html = requests.get("https://www.patreon.com/cubecoders").text
full_html = BeautifulSoup(plain_html, "html.parser")
text_of_newest_post = full_html.find("span", class_="sc-1di2uql-1 vYcWR")
print(text_of_newest_post)

Output: None

我已经尝试使用XPath或CSS选择器来获取元素,但是无法做到。我认为可能是因为该网站首先使用JavaScript渲染,因此在正确渲染元素之前我无法访问它们。 当我先使用Selenium渲染网站时,在页面上打印所有div标签时都可以看到标题,但是当我只想获得第一个标题时,就无法访​​问它。

你们知道解决方法吗? 预先感谢!

编辑: 在Selenium中,我可以这样做:

from selenium import webdriver
browser = webdriver.Chrome("C:\webdrivers\chromedriver.exe")
browser.get("https://www.patreon.com/cubecoders")
divs = browser.find_elements_by_tag_name("div")


def find_text(divs):
    for div in divs:
        for span in div.find_elements_by_tag_name("span"):
            if span.get_attribute("class") == "sc-1di2uql-1 vYcWR":
                return span.text

            
print(find_text(divs))
browser.close()

Output: New in AMP 2.0.2 - Integrated SCP/SFTP server!

当我从头开始尝试使用类'sc-1di2uql-1 vYcWR'搜索跨度时,它不会给我结果。难道find_elements方法在嵌套标记中看起来不会更深入?

1 个答案:

答案 0 :(得分:0)

您看到的数据是通过Ajax从其API加载的。您可以使用requests模块来加载数据。

例如:

import re
import json
import requests
from bs4 import BeautifulSoup


url = 'https://www.patreon.com/cubecoders'
api_url = 'https://www.patreon.com/api/posts'
headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0',
    'Accept-Language': 'en-US,en;q=0.5',
    'Referer': url
}


with requests.session() as s:
    html_text = s.get(url, headers=headers).text
    campaign_id = re.search(r'https://www\.patreon\.com/api/campaigns/(\d+)', html_text).group(1)
    data = s.get(api_url, headers=headers, params={'filter[campaign_id]': campaign_id, 'filter[contains_exclusive_posts]': 'true', 'sort': '-published_at'}).json()

    # uncomment this to print all data:
    # print(json.dumps(data, indent=4))

    # print some information to screen:
    for d in data['data']:
        print('{:<70} {}'.format(d['attributes']['title'], d['attributes']['published_at']))

打印:

New in AMP 2.0.2 - Integrated SCP/SFTP server!                         2020-07-17T13:28:49.000+00:00
AMP Enterprise Pricing Reveal!                                         2020-07-07T10:02:02.000+00:00
AMP Enterprise Edition Waiting List                                    2020-07-03T13:25:35.000+00:00
Upcoming changes to the user system                                    2020-05-29T10:53:43.000+00:00
More video tutorials! What do you want to see?                         2020-05-21T12:20:53.000+00:00
Third AMP tutorial - Windows installation!                             2020-05-21T12:19:23.000+00:00
Another day, another video tutorial!                                   2020-05-08T22:56:45.000+00:00
AMP Video Tutorial - Out takes!                                        2020-05-05T23:01:57.000+00:00
AMP Video Tutorials - Installing AMP on Linux                          2020-05-05T23:01:46.000+00:00
What is the AMP Console Assistant (AMPCA), and why does it exist?      2020-05-04T01:14:39.000+00:00
Well that was unexpected...                                            2020-05-01T11:21:09.000+00:00
New Goal - MariaDB/MySQL Support!                                      2020-04-22T13:41:51.000+00:00
Testing out AMP Enterprise Features                                    2020-03-31T18:55:42.000+00:00
Temporary feature unlock for all Patreon backers!                      2020-03-11T14:53:31.000+00:00
Preparing for Enterprise                                               2020-03-11T13:09:40.000+00:00
Aarch64/ARM64 and Raspberry Pi is here!                                2020-03-06T19:07:09.000+00:00
Aarch64/ARM64 and Raspberry Pi progress!                               2020-02-26T17:53:53.000+00:00
Wallpaper!                                                             2020-02-13T11:04:39.000+00:00
Instance Templating - Make once, deploy many.                          2020-02-06T15:26:09.000+00:00
Time for a new module!                                                 2020-01-07T13:41:17.000+00:00