Question

import requests
from bs4 import BeautifulSoup

url = 'https://www.brightscope.com/401k-rating/240370/Abengoa-Bioenergy-Company-Llc/244317/Abengoa-Bioenergy-Us-401K-Savings-Plan/'
thepage = requests.get(url)
urlsoup = BeautifulSoup(thepage.text, "html.parser")

plandata = urlsoup.find(class_="plans-section").text

print(plandata)

我;我试图只抓取评级号的等级但是当我使用这段代码时，我什么也得不回来:(。

我如何只收取等级编号？
我如何刮掉多个类（这是最重要的部分）并将它们放入可读的列表中？

我的想法是循环每个页面并将它们附加到带有新行的.csv文件中。

以下示例;

Rating #1, Company Name1, etc, etc, etc

Rating #2, Company Name2, etc, etc, etc

我无法克服困难解决这个问题。感谢您的帮助！

编辑 - 课程＆＃34;计划部分＆＃34;保存我想要的数据，但它似乎被分解为两个div标签。我想在课堂上抓取数据＆＃34;数据文本高于平均水平＆＃34;。问题是每个页面似乎只有相同的数据文本＆＃34;每个部分/页面的更改后会发生什么。我有什么选择？

Answer 1

你到底想要离开这个页面的是什么？如果你想逐个上课，这应该会有所帮助。

urlsoup.findAll("div", { "class" :"rating-number"})

Answer 2

import requests
from bs4 import BeautifulSoup


url = 'https://www.brightscope.com/401k-rating/141759/Aj-Kirkwood-Associates-Inc/143902/Aj-Kirkwood-Associates-Inc-401K-Profit-Sharing-Plan/'
thepage = requests.get(url)
urlsoup = BeautifulSoup(thepage.text, "html.parser")

rate = urlsoup.find(class_='rating-number').text
name = urlsoup.find(class_="name").text
print(rate, name)

出：

59 A.J. Kirkwood & Associates, Inc.

使用re过滤器来匹配包含特定文本的所有类：

If you pass in a regular expression object, Beautiful Soup will filter against that regular expression using its search() method.

在你的情况下：

import re
soup.find_all(class_=re.compile(r'data-text.+'))

Python和beautifulsoup - Scrape Text

2 个答案: