Question

我尝试使用beautifulsoup从html代码中删除br标记。

html例如：

<span class="qualification" style="font-size:14px; font-family: Helvetica, sans-serif;">
Doctor of Philosophy ( Software Engineering ), Universiti Teknologi Petronas
<br>
Master of Science (Computer Science), Government College University Lahore
<br>
Master of Science ( Computer Science ), University of Agriculture Faisalabad
<br>
Bachelor of Science (Hons) ( Agriculture ),University of Agriculture Faisalabad
<br></span>

我的python代码：

 for link2 in soup.find_all('br'):
        link2.extract()
 for link2 in soup.findAll('span',{'class':'qualification'}):
        print(link2.string)

问题是前面的代码只获得了第一个资格。

Answer 1

因为这些<br>都没有关闭对应物，所以Beautiful Soup会自动添加它们，从而生成以下HTML：

In [23]: soup = BeautifulSoup(html)

In [24]: soup.br
Out[24]: 
<br>
Master of Science (Computer Science), Government College University Lahore
<br>
Master of Science ( Computer Science ), University of Agriculture Faisalabad
<br>
Bachelor of Science (Hons) ( Agriculture ),University of Agriculture Faisalabad
<br/></br></br></br>

当您在第一个Tag.extract标记上调用<br>时，会删除其所有后代和字符串，其后代包含：

In [27]: soup
Out[27]: 
<span class="qualification" style="font-size:14px; font-family: Helvetica, sans-serif;">
Doctor of Philosophy ( Software Engineering ), Universiti Teknologi Petronas
</span>

您似乎只需要从span元素中提取所有文本。如果是这种情况，请不要去除任何东西：

In [28]: soup.span.text
Out[28]: '\nDoctor of Philosophy ( Software Engineering ), Universiti Teknologi Petronas\n\nMaster of Science (Computer Science), Government College University Lahore\n\nMaster of Science ( Computer Science ), University of Agriculture Faisalabad\n\nBachelor of Science (Hons) ( Agriculture ),University of Agriculture Faisalabad\n'

Tag.text属性从给定标记中提取所有字符串。

Answer 2

使用unwrap应该可以工作

soup = BeautifulSoup(html)
for match in soup.findAll('br'):
    match.unwrap()

Answer 3

这是一种方法：

for link2 in soup.findAll('span',{'class':'qualification'}):
    for s in link2.stripped_strings:
        print(s)

除非您要求将其删除以供日后处理，否则无需删除<br>标记。这里link2.stripped_strings是一个生成器，它生成标记中的每个字符串，去掉前导和尾随空格。打印循环可以更简洁地写成：

for link2 in soup.findAll('span',{'class':'qualification'}):
    print(*link2.stripped_strings, sep='\n')

Python beautifulsoup删除自闭标签

3 个答案: