我试图使用beautifulsoup解析一个简单的html表但我有一些问题
这是我的输入
<table id="people" class="tt" width="99%" border="0" cellpadding="0" cellspacing="1">
<tr>
<td colspan="3" bgcolor="#d3d3d3">
<p align="center" style="border: 1px solid #c0c0c0; padding: 0.02in">
<a name="faculty">
</a>
<b>
Faculty
</b>
</p>
</td>
</tr>
<tr>
<td>
<p align="center">
<font color="#000080">
<a href="http://www.website.com/%7Empop">
<font color="#000080">
<img src="images/mpop.jpg" name="graphics1" align="bottom" width="70" height="85" border="1" />
</font>
</a>
</font>
</p>
</td>
<td>
<p>
<b>
John Doe, Ph.D.
</b>
<br />
Associate Professor, Computer
Science
<br />
</p>
</td>
<td>
<p>
Office: Sciences Bldg.
<br />
Phone:
xxx-xxx-xxxx
<br />
jd [at] website.com
<br />
</p>
</td>
</tr>
<tr>
<td>
<p align="center">
<font color="#000080">
<a href="http://www.website.com/%7Ercolwell">
<font color="#000080">
<img src="images/rcolwell.jpg" name="graphics2" align="bottom" width="70" height="97" border="1" />
</font>
</a>
</font>
</p>
</td>
<td>
<p>
<b>
Jane Doe, Ph.D.
</b>
<br />
Professor
<br />
School of Public Health
<br />
</p>
</td>
<td>
<p>
Sciences Bldg
<br />
jd [at]
website.com
<br />
</a>
</p>
</td>
</tr>
</table>
这是我的代码
t = soup.findAll("table",id="people")
for table in t:
rows = table.findAll("tr")
for tr in rows:
cols = tr.findAll("td")
for td in cols:
print(str(td.find(text=True))) # tried also print(td.find(text=True))
print(",")
print("\n")
这将只生成带有逗号但没有实际文本的输出,但是当我放置print(td)
时,我确实找到了我需要输出的信息但是html格式的所有标签,有人能指出我的权利在这做什么?我想只提取单元格内容。
干杯
答案 0 :(得分:0)
也许你正在寻找s.t.像这样:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("<table id=people><tr><td>x<a>y</a>z</td><td>x<a>y</a>z</td></tr></table>")
t = soup.findAll("table",id="people")
for table in t:
rows = table.findAll("tr")
for tr in rows:
cols = tr.findAll("td")
print(','.join([td.text for td in cols]))
或者,您可以使用u''.join(map(unicode, td.contents))
,具体取决于您想要打印的内容。