我正在尝试提取位于<table>
标记上方和下方的HTML部分,例如,从下面的示例html中提取>
sample_html = """
<html>
<title><b>Main Title</b></Title>
<b>more</b>
<b>stuff</b>
<b>in here!</b>
<table class="softwares" border="1" cellpadding="0" width="99%">
<thead style="background-color: #ededed">
<tr>
<td colspan="5"><b>Windows</b></td>
</tr>
</thead>
<tbody>
<tr>
<td><b>Type</b></td>
<td><b>Issue</b></td>
<td><b>Restart</b></td>
<td><b>Severity</b></td>
<td><b>Impact</b></td>
</tr>
<tr>
<td>some item</td>
<td><a href="some website">some website</a><br></td>
<td>Yes<br></td>
<td>Critical<br></td>
<td>stuff<br></td>
</tr>
<tr>
<td>some item</td>
<td><a href="some website">some website</a><br></td>
<td>Yes<br></td>
<td>Important<br></td>
<td>stuff<br></td>
</tr>
</tbody>
</table>
<b>AGAIN</b>
<b>more</b>
<b>stuff</b>
<b>down here!</b>
</html>
"""
我想获得类似的东西。
top_html = """
<html>
<title><b>Main Title</b></Title>
<b>more</b>
<b>stuff</b>
<b>in here!</b>
</html>
"""
bottom_html = """
<html>
<b>AGAIN</b>
<b>more</b>
<b>stuff</b>
<b>down here!</b>
</html>
"""
或者已经是文本格式,例如:
top_html = 'Main Title more stuff down here!'
bottom_html = 'AGAIN more stuff down here!'
因此,我已经能够从整个HTML中提取<table>
部分并进行处理(我将行<tr>
和列<td>
分开,以便提取值我需要),并带有以下代码:
soup = BeautifulSoup(input_html, "html.parser")
table = soup.find('table')
答案 0 :(得分:1)
此解决方案并未广泛使用BeautifulSoup,但可以使用。 获取打开和关闭表标签的索引,提取前后的字符串。
soup = BeautifulSoup(sample_html, "html.parser")
def extract_top_and_bottom(soup):
index_of_opening_tag = soup.index("<table")
index_of_closing_tag = soup.index("</table>")
top_html = soup[:index_of_opening_tag]
bottom_html = soup[index_of_closing_tag::].replace("</table>", '')
print(top_html)
print(bottom_html)
extract_top_and_bottom(str(soup))
答案 1 :(得分:1)
在表格html上分割html,然后提取文本
from bs4 import BeautifulSoup as bs
sample_html = """
<html>
<title><b>Main Title</b></Title>
<b>more</b>
<b>stuff</b>
<b>in here!</b>
<table class="softwares" border="1" cellpadding="0" width="99%">
<thead style="background-color: #ededed">
<tr>
<td colspan="5"><b>Windows</b></td>
</tr>
</thead>
<tbody>
<tr>
<td><b>Type</b></td>
<td><b>Issue</b></td>
<td><b>Restart</b></td>
<td><b>Severity</b></td>
<td><b>Impact</b></td>
</tr>
<tr>
<td>some item</td>
<td><a href="some website">some website</a><br></td>
<td>Yes<br></td>
<td>Critical<br></td>
<td>stuff<br></td>
</tr>
<tr>
<td>some item</td>
<td><a href="some website">some website</a><br></td>
<td>Yes<br></td>
<td>Important<br></td>
<td>stuff<br></td>
</tr>
</tbody>
</table>
<b>AGAIN</b>
<b>more</b>
<b>stuff</b>
<b>down here!</b>
</html>
"""
soup = bs(sample_html, 'lxml')
results = str(soup).split(str(soup.select_one('table.softwares')))
top_text = bs(results[0], 'lxml').get_text().replace('\n',' ')
bottom_text = bs(results[1], 'lxml').get_text().replace('\n',' ')
print(top_text)
print(bottom_text)