我的html包含许多标签,然后是那些标签之外的文本。我试图获得的文本是在第一个实例之外的
标签中,我猜这只是标签的一部分。但是如果我尝试获取标签的文本(比如td.text或类似的东西)那么它也会给我所有和标签中的所有文本。
<td align="left">
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/1740935">
Garcia, Leury
</a>
SS CHW - Traded from Royal Disappointments
<br>
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/1813191">
Almonte, Abraham
</a>
OF SEA - Traded from Royal Disappointments
<br>
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/2046044">
Pillar, Kevin
</a>
OF TOR - Traded from Royal Disappointments
<br>
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/1666824">
Sierra, Moises
</a>
LF TOR - Traded from Royal Disappointments
<br>
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/580599">
Paulino, Felipe
</a>
SP KC
<span title="Felipe Paulino off 60-day DL">
<a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/580599" subtab="Update">
<img border="0" height="10" src="http://sports.cbsimg.net/images/news-note-recent.gif" width="10"/>
</a>
</span>
- Traded from Royal Disappointments
</br>
</br>
</br>
</br>
</td>
基本上我想要(作为单独的值)标签中的每个文本,然后是标签外的每个文本。所以最终的结果是:
Garcia,Leury
SS CHW - 从皇家失望交易阿尔蒙特,亚伯拉罕
Pillar,凯文
OF TOR - 从皇家失望交易
Sierra,Moises
LF TOR - 从皇家失望交易
Paulino,Felipe
SP KC - 从皇家失望交易
到目前为止,我只有来自a标签的文本代码:
pl = psoup.findAll('a',{'class': 'playerLink'})
for a in pl:
print a.text
我真的不知道如何处理其余部分。
答案 0 :(得分:2)
如何在psoup
上致电get_text呢?
(Pdb) print soup.get_text()
Garcia, Leury
SS CHW - Traded from Royal Disappointments
Almonte, Abraham
OF SEA - Traded from Royal Disappointments
Pillar, Kevin
OF TOR - Traded from Royal Disappointments
Sierra, Moises
LF TOR - Traded from Royal Disappointments
Paulino, Felipe
SP KC
- Traded from Royal Disappointments
答案 1 :(得分:2)
您可以使用Tag.next
属性(别名为Tag.next_element
):
for a in psoup('a': {'class': 'playerLink'}):
print a.text
print a.next.next
实际上,每个“外部”文本都是链接后的第二个元素(第一个元素是链接锚点)。