我有一个包含几列的表。最后一列可能包含指向文档的链接,但未确定每个单元格的链接数(从0到无穷大)。
<tbody>
<tr>
<td>
<h2>Table Section</h2>
</td>
</tr>
<tr>
<td>
<a href="#">Object 1</a>
</td>
<td>Param 1</td>
<td>
<span class="text-nowrap">Param 2</span>
</td>
<td class="text-nowrap"></td>
</tr>
<tr>
<td>
<a href="#">Object 2</a>
</td>
<td>Param 1</td>
<td>
<span class="text-nowrap">Param 2</span>
<td>
<ul>
<li>
<small>
<a href="link_to.doc">Title</a>Notes
</small>
</li>
<li>
<small>
<a href="another_link_to.doc">Title2</a>Notes2
</small>
</li>
</ul>
</td>
</tr>
</tbody>
因此,基本解析不是问题。我一直想获得带有标题和注释的链接,并将它们附加到python的列表(或numpy数组)中。
from bs4 import BeautifulSoup
with open("new 1.html", encoding="utf8") as dump:
soup = BeautifulSoup(dump, features="lxml")
data = []
table_body = soup.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append(cols)
a = row.find_all('a')
for ele1 in a:
if ele1.get('href') != "#":
data.append([ele1.get('href')])
print(*data, sep='\n')
输出:
['Table Section']
['Object 1', 'Param 1', 'Param 2', '']
['Object 2', 'Param 1', 'Param 2', 'TitleNotes\n\t\t\t \n\n\n\nTitle2Notes2']
['link_to.doc']
['another_link_to.doc']
是否可以将链接附加到第一个列表?我希望第二行的列表如下所示:
['Object 2', 'Param 1', 'Param 2', 'Title', 'Notes', 'link_to.doc', ' Title2', 'Notes2', 'another_link_to.doc']
答案 0 :(得分:0)
类似这样的东西
from bs4 import BeautifulSoup
html = '''<tbody>
<tr>
<td>
<h2>Table Section</h2>
</td>
</tr>
<tr>
<td>
<a href="#">Object 1</a>
</td>
<td>Param 1</td>
<td>
<span class="text-nowrap">Param 2</span>
</td>
<td class="text-nowrap"></td>
</tr>
<tr>
<td>
<a href="#">Object 2</a>
</td>
<td>Param 1</td>
<td>
<span class="text-nowrap">Param 2</span>
<td>
<ul>
<li>
<small>
<a href="link_to.doc">Title</a>Notes
</small>
</li>
<li>
<small>
<a href="another_link_to.doc">Title2</a>Notes2
</small>
</li>
</ul>
</td>
</tr>
</tbody>'''
soup = BeautifulSoup(html, features="lxml")
smalls = soup.find_all('small')
links = [s.contents[1].attrs['href'] for s in smalls]
print(links)
输出
['link_to.doc', 'another_link_to.doc']