Python beautifulsoup删除自闭标签

时间:2016-07-27 10:10:38

标签: python beautifulsoup

我尝试使用beautifulsoup从html代码中删除br标记。

html例如:

<span class="qualification" style="font-size:14px; font-family: Helvetica, sans-serif;">
Doctor of Philosophy ( Software Engineering ), Universiti Teknologi Petronas
<br>
Master of Science (Computer Science), Government College University Lahore
<br>
Master of Science ( Computer Science ), University of Agriculture Faisalabad
<br>
Bachelor of Science (Hons) ( Agriculture ),University of Agriculture Faisalabad
<br></span>

我的python代码:

 for link2 in soup.find_all('br'):
        link2.extract()
 for link2 in soup.findAll('span',{'class':'qualification'}):
        print(link2.string)

问题是前面的代码只获得了第一个资格。

3 个答案:

答案 0 :(得分:1)

因为这些<br>都没有关闭对应物,所以Beautiful Soup会自动添加它们,从而生成以下HTML:

In [23]: soup = BeautifulSoup(html)

In [24]: soup.br
Out[24]: 
<br>
Master of Science (Computer Science), Government College University Lahore
<br>
Master of Science ( Computer Science ), University of Agriculture Faisalabad
<br>
Bachelor of Science (Hons) ( Agriculture ),University of Agriculture Faisalabad
<br/></br></br></br>

当您在第一个Tag.extract标记上调用<br>时,会删除其所有后代和字符串,其后代包含:

In [27]: soup
Out[27]: 
<span class="qualification" style="font-size:14px; font-family: Helvetica, sans-serif;">
Doctor of Philosophy ( Software Engineering ), Universiti Teknologi Petronas
</span>

您似乎只需要从span元素中提取所有文本。如果是这种情况,请不要去除任何东西:

In [28]: soup.span.text
Out[28]: '\nDoctor of Philosophy ( Software Engineering ), Universiti Teknologi Petronas\n\nMaster of Science (Computer Science), Government College University Lahore\n\nMaster of Science ( Computer Science ), University of Agriculture Faisalabad\n\nBachelor of Science (Hons) ( Agriculture ),University of Agriculture Faisalabad\n'

Tag.text属性从给定标记中提取所有字符串。

答案 1 :(得分:0)

使用unwrap应该可以工作

soup = BeautifulSoup(html)
for match in soup.findAll('br'):
    match.unwrap()

答案 2 :(得分:0)

这是一种方法:

for link2 in soup.findAll('span',{'class':'qualification'}):
    for s in link2.stripped_strings:
        print(s)

除非您要求将其删除以供日后处理,否则无需删除<br>标记。这里link2.stripped_strings是一个生成器,它生成标记中的每个字符串,去掉前导和尾随空格。打印循环可以更简洁地写成:

for link2 in soup.findAll('span',{'class':'qualification'}):
    print(*link2.stripped_strings, sep='\n')