我尝试使用beautifulsoup从html代码中删除br
标记。
html例如:
<span class="qualification" style="font-size:14px; font-family: Helvetica, sans-serif;">
Doctor of Philosophy ( Software Engineering ), Universiti Teknologi Petronas
<br>
Master of Science (Computer Science), Government College University Lahore
<br>
Master of Science ( Computer Science ), University of Agriculture Faisalabad
<br>
Bachelor of Science (Hons) ( Agriculture ),University of Agriculture Faisalabad
<br></span>
我的python代码:
for link2 in soup.find_all('br'):
link2.extract()
for link2 in soup.findAll('span',{'class':'qualification'}):
print(link2.string)
问题是前面的代码只获得了第一个资格。
答案 0 :(得分:1)
因为这些<br>
都没有关闭对应物,所以Beautiful Soup会自动添加它们,从而生成以下HTML:
In [23]: soup = BeautifulSoup(html)
In [24]: soup.br
Out[24]:
<br>
Master of Science (Computer Science), Government College University Lahore
<br>
Master of Science ( Computer Science ), University of Agriculture Faisalabad
<br>
Bachelor of Science (Hons) ( Agriculture ),University of Agriculture Faisalabad
<br/></br></br></br>
当您在第一个Tag.extract
标记上调用<br>
时,会删除其所有后代和字符串,其后代包含:
In [27]: soup
Out[27]:
<span class="qualification" style="font-size:14px; font-family: Helvetica, sans-serif;">
Doctor of Philosophy ( Software Engineering ), Universiti Teknologi Petronas
</span>
您似乎只需要从span
元素中提取所有文本。如果是这种情况,请不要去除任何东西:
In [28]: soup.span.text
Out[28]: '\nDoctor of Philosophy ( Software Engineering ), Universiti Teknologi Petronas\n\nMaster of Science (Computer Science), Government College University Lahore\n\nMaster of Science ( Computer Science ), University of Agriculture Faisalabad\n\nBachelor of Science (Hons) ( Agriculture ),University of Agriculture Faisalabad\n'
Tag.text
属性从给定标记中提取所有字符串。
答案 1 :(得分:0)
使用unwrap应该可以工作
soup = BeautifulSoup(html)
for match in soup.findAll('br'):
match.unwrap()
答案 2 :(得分:0)
这是一种方法:
for link2 in soup.findAll('span',{'class':'qualification'}):
for s in link2.stripped_strings:
print(s)
除非您要求将其删除以供日后处理,否则无需删除<br>
标记。这里link2.stripped_strings
是一个生成器,它生成标记中的每个字符串,去掉前导和尾随空格。打印循环可以更简洁地写成:
for link2 in soup.findAll('span',{'class':'qualification'}):
print(*link2.stripped_strings, sep='\n')