Question

我正在尝试使用BeautifulSoup从多个新闻网站的首页中提取标题。我正在学习Python，但是对HTML，JavaScript的CSS知识不多，因此我在Chrome上使用Inspect进行了反复试验。这是我在《纽约时报》网页上为此编写的代码：

import requests from bs4
import BeautifulSoup


url = "https://www.nytimes.com/"
r = requests.get(url)
r_html = r.text
soup = BeautifulSoup(r_html, features="html.parser")
headlines = soup.find_all(class_="css-1vynn0q esl82me3")

for item in headlines:
    if len(item.contents) == 1:
        print(item.text)
    elif len(item.contents) == 2:
        print(item.contents[1].text)

这是我的问题：

当我计划对多个新闻网站进行此操作时，有没有比您可以建议的方法更好的解决方案？
我注意到自编写此代码以来，CSS标记已更改，因此我不得不对其进行更新。有什么解决方案不需要每次更新标签时都更改代码吗？

Answer 1

有可能，因为您可以在html中找到<script>标签，然后将其解析为json格式。并非所有新闻网站都可能会专门使用该标签，因为很可能会有不同的标签/代码来标识标题标签，但是您可以使用通用的工作代码来提取这些标题，即使它们稍后更新。

像平常一样解析html：

import requests 
from bs4 import BeautifulSoup
import json

url = "https://www.nytimes.com/"
r = requests.get(url)
r_html = r.text
soup = BeautifulSoup(r_html, "html.parser")

然后找到所有<script>标签。我们想要的一个以文本window.__preloadedData =开头，因此我们只想从它找到的带有<script>标签的14个元素中进行搜索：

scripts = soup.find_all('script')
for script in scripts:
    if 'preloadedData' in script.text:
        jsonStr = script.text

找到它后，我们将其存储为jsonStr，然后开始修剪掉字符串的开头和结尾部分，以将其更改为纯json格式，然后可以使用{{1 }}，并将其存储为我们的json.loads()：

jsonObj

一旦有了jsonStr = jsonStr.split('=', 1)[1].strip() jsonStr = jsonStr.rsplit(';', 1)[0] jsonObj = json.loads(jsonStr)，我们将遍历结构中的key：values来查找与json对象中的jsonObj key关联的值：

headline

完整代码：

我还添加了一个日期时间元素，因为您可能想存储该元素，以查看稍后在特定日期/时间更新的标题。

for ele, v in jsonObj['initialState'].items():
    try:
        if v['headline']:
            print(v['headline'])
    except:
        continue

输出：

import requests 
from bs4 import BeautifulSoup
import json
import datetime



url = "https://www.nytimes.com/"
r = requests.get(url)
now = datetime.datetime.now()
now = now.strftime('%A, %B %d, %Y  %I:%M %p')

r_html = r.text
soup = BeautifulSoup(r_html, "html.parser")

scripts = soup.find_all('script')
for script in scripts:
    if 'preloadedData' in script.text:
        jsonStr = script.text
        jsonStr = jsonStr.split('=', 1)[1].strip()
        jsonStr = jsonStr.rsplit(';', 1)[0]
        jsonObj = json.loads(jsonStr)


print ('%s\nHeadlines\n%s\n' %(url, now))
count = 1
for ele, v in jsonObj['initialState'].items():
    try:
        if v['headline'] and v['__typename'] == 'PromotionalProperties':
            print('Headline %s: %s' %(count, v['headline']))
            count += 1
    except:
        continue

在Python中使用BeautifulSoup从新闻网站首页中获取标题

1 个答案: