HTML code:
<table border="0" cellpadding="0" cellspacing="0">
<tbody>
<tr><th>Name</th><th>Email</th><th>Supervisor</th><th>Room</th><th>Phone</th></tr>
<tr>
<td>Anastasiou, Alexandros</td>
<td><a href="mailto:alexandros.anastasiou07">alexandros.anastasiou07</a></td>
<td>Prof Duff</td>
<td>512b</td>
<td>47838</td>
</tr>
<tr>
<td>Ashmore, Anthony</td>
<td><a href="mailto:a.ashmore12">a.ashmore12</a></td>
<td>Prof Waldram</td>
<td>512b</td>
<td>47838</td>
</tr>
<tr>
<td>Banks, Elliot</td>
<td><a href="mailto:EB713">EB713</a></td>
<td>Prof Gauntlett</td>
<td>512a</td>
<td>47839</td>
</tr>
</tbody>
</table>
以上是html代码。在每个tr的第3个td标签中包含更多标签......请帮帮我。
我的python代码:
souphandler=BeautifulSoup(htmltext)
table=souphandler.find('table')
tr_tag=table.find('tr')
try:
while(tr_tag is not None):
for row in tr_tag:
print(row.string)
tr_tag=tr_tag.findNext('tr')
在此代码中,它反复多次打印所有内容。我想提取tr标签中的所有数据..
答案 0 :(得分:0)
你需要找到tr标签并从第一个标签中提取th标签,从其他标签中提取td标签:
h = """
<table border="0" cellpadding="0" cellspacing="0">
<tr><th>Name</th><th>Email</th><th>Supervisor</th><th>Room</th><th>Phone</th></tr>
<tr>
<td>Anastasiou, Alexandros</td>
<td><a href="mailto:alexandros.anastasiou07">alexandros.anastasiou07</a></td>
<td>Prof Duff</td>
<td>512b</td>
<td>47838</td>
</tr>
<tr>
<td>Ashmore, Anthony</td>
<td><a href="mailto:a.ashmore12">a.ashmore12</a></td>
<td>Prof Waldram</td>
<td>512b</td>
<td>47838</td>
</tr>
<tr>
<td>Banks, Elliot</td>
<td><a href="mailto:EB713">EB713</a></td>
<td>Prof Gauntlett</td>
<td>512a</td>
<td>47839</td>
</tr>
</table>"""
soup = BeautifulSoup(h)
table = soup.find("table")
print(",".join([th.text for th in table.find("tr").find_all("th")]))
for tr in table.select("tr + tr"):
tds = tr.find_all("td")
print(tds[1].a["href"])
print(", ".join([td.text for td in tds]))
哪会给你:
Name,Email,Supervisor,Room,Phone
mailto:alexandros.anastasiou07
Anastasiou, Alexandros, alexandros.anastasiou07, Prof Duff, 512b, 47838
mailto:a.ashmore12
Ashmore, Anthony, a.ashmore12, Prof Waldram, 512b, 47838
mailto:EB713
Banks, Elliot, EB713, Prof Gauntlett, 512a, 47839