使用Python和BeautifulSoup解析HTML - 在<a> tags</a>内外获取文本

时间:2013-12-06 21:59:30

标签: python html python-2.7 web-scraping beautifulsoup

我的html包含许多标签,然后是那些标签之外的文本。我试图获得的文本是在第一个实例之外的
标签中,我猜这只是标签的一部分。但是如果我尝试获取标签的文本(比如td.text或类似的东西)那么它也会给我所有和标签中的所有文本。

    <td align="left">
     <a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/1740935">
      Garcia, Leury
     </a>
     SS CHW - Traded from Royal Disappointments
     <br>
      <a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/1813191">
       Almonte, Abraham
      </a>
      OF SEA - Traded from Royal Disappointments
      <br>
       <a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/2046044">
        Pillar, Kevin
       </a>
       OF TOR - Traded from Royal Disappointments
       <br>
        <a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/1666824">
         Sierra, Moises
        </a>
        LF TOR - Traded from Royal Disappointments
        <br>
         <a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/580599">
          Paulino, Felipe
         </a>
         SP KC
         <span title="Felipe Paulino off 60-day DL">
          <a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/580599" subtab="Update">
           <img border="0" height="10" src="http://sports.cbsimg.net/images/news-note-recent.gif" width="10"/>
          </a>
         </span>
         - Traded from Royal Disappointments
        </br>
       </br>
      </br>
     </br>
    </td>

基本上我想要(作为单独的值)标签中的每个文本,然后是标签外的每个文本。所以最终的结果是:

Garcia,Leury

SS CHW - 从皇家失望交易

阿尔蒙特,亚伯拉罕

OF SEA - 从皇家失望交易

Pillar,凯文

OF TOR - 从皇家失望交易

Sierra,Moises

LF TOR - 从皇家失望交易

Paulino,Felipe

SP KC - 从皇家失望交易

到目前为止,我只有来自a标签的文本代码:

        pl = psoup.findAll('a',{'class': 'playerLink'})
        for a in pl:          
            print a.text

我真的不知道如何处理其余部分。

2 个答案:

答案 0 :(得分:2)

如何在psoup上致电get_text呢?

(Pdb) print soup.get_text()


      Garcia, Leury

     SS CHW - Traded from Royal Disappointments


       Almonte, Abraham

      OF SEA - Traded from Royal Disappointments


        Pillar, Kevin

       OF TOR - Traded from Royal Disappointments


         Sierra, Moises

        LF TOR - Traded from Royal Disappointments


          Paulino, Felipe

         SP KC





         - Traded from Royal Disappointments

答案 1 :(得分:2)

您可以使用Tag.next属性(别名为Tag.next_element):

for a in psoup('a': {'class': 'playerLink'}):
    print a.text
    print a.next.next

实际上,每个“外部”文本都是链接后的第二个元素(第一个元素是链接锚点)。