如何删除"表"从HTML使用python?
我有这样的情况:
paragraph = '''
<p>Lorem ipsum dolor sit amet, consectetur adipisicing elit. Quidem molestiae consequuntur officiis corporis sint.<br /><br />
<table>
<tr>
<td>
text title
</td>
<td>
text title 2
</td>
</tr>
</table>
<p> lorem ipsum</p>
'''
如何使用python删除上面的表结构内容? 我希望产生的输出如下:
paragraph = '''
<p>Lorem ipsum dolor sit amet, consectetur adipisicing elit. Quidem molestiae consequuntur officiis corporis sint.<br /><br />
<p> lorem ipsum</p>
'''
答案 0 :(得分:6)
您可以使用BeautifulSoup
,尤其是PageElement.extract()
In [16]: from bs4 import BeautifulSoup
In [17]: soup = BeautifulSoup("""<p>Lorem ipsum dolor sit amet, consectetur adipisicing elit. Quidem molestiae consequuntur officiis corporis sint.<br /><br />
....: <table>
....: <tr>
....: <td>
....: text title or some
....: </td>
....: </tr>
....: </table>
....: <p> lorem ipsum</p>""")
In [18]: _ = soup.table.extract()
In [19]: soup
Out[19]:
<html><body><p>Lorem ipsum dolor sit amet, consectetur adipisicing elit. Quidem molestiae consequuntur officiis corporis sint.<br/><br/>
</p>
<p> lorem ipsum</p></body></html>
答案 1 :(得分:0)
您也可以尝试这种基本的字符串格式
paragraph = paragraph[:paragraph.find('<table>')] + # Find the starting letter of '<table>'
paragraph[paragraph.find('</table>')+ # Find the starting letter of </table>
(len('<\table>')+1):] # Add 1 because length starts from zero
print(paragraph)
即使是基本文本提取的工作
答案 2 :(得分:0)
使用正则表达式很复杂,这是我提出的一种愚蠢方式:
def remove_table(s):
left_index = s.find('<table>')
if -1 == left_index:
return s
right_index = s.find('</table>', left_index)
return s[:left_index] + remove_table(s[right_index + 8:])
结果中可能有一些空白行。