美丽的汤解析多个标签

时间:2019-09-08 21:54:39

标签: python beautifulsoup screen-scraping

我正在使用学校成绩系统的数据,并且试图弄清楚如何按类别提取数据。

这是原始HTML:https://pastebin.com/icbaemd7

现在,我已经编写了Python脚本:

html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

chemData = ((soup.find_all('td')))
content = []
print chemData
print ""
for i in chemData:
    content.append(i.getText().split('</td')[0])
for k in content:
    print (k)

哪个返回此结果:

Safety Contract Signed
1/1
8/13/2019

Student Profile Sheet Turned In
1/1
8/13/2019

Polyatomic Ion Quiz
10/10
8/19/2019

HW Quiz Ch. 3 Target 6
3/3
8/27/2019

HW Quiz (Ch. 3 Targets 1-6)
12/16
8/28/2019

Chapters 1 & 2 Formative Quiz
15/17
8/21/2019

Chapter 3 Formative Quiz
23.5/25
9/5/2019

Lab Report: Antifreeze Lab
10/10
8/21/2019

Types of Reactions Lab Report
11/12
8/23/2019

Hydrate Lab Report
10/10
8/29/2019

Lab Assessment - Types of Reactions Lab
10/15
8/26/2019

Lab Assessment: Hydrate Lab
10/10
9/3/2019

但是我想将它们分类为HTML中显示的类别。如果我使用h3而不是td运行相同的脚本,则会得到它们:

Homework
Formative Quizzes
Lab Reports
Lab Assessments

所以我的问题是:如何获得将自动分配的作业自动分类到各个类别的信息?

任何帮助将不胜感激。谢谢!

2 个答案:

答案 0 :(得分:0)

您的html不能正确呈现。但是,作为快速解决方案,请为每个类别查找同时包含h3标签和表的父容器,然后首先废弃父容器。例如,让我们假设h3标签和表格位于div下。然后,首先废弃div标签,即d = soup.findall('div')。然后循环遍历d,以提取h3标签,然后提取tr / td。例如d [0] .findall('h3')d [0] .findall('td')等等。

答案 1 :(得分:0)

请尝试以下类似的方法,在其中测试h3并设置字典键,否则在当前dict [key]下添加行中的值

from bs4 import BeautifulSoup as bs

html = '''yourHTML'''
soup = bs(html, 'lxml')
results = {}

for i in soup.select('h3, tr'):
    if i.name == 'h3':
        header = i.text
        results[header] = []
    else:
        results[header].append(' '.join([n.text for n in i.select('td')]))
print(results)