如何在beautifulsoup中获取带标签的td内容?

时间:2018-03-15 22:37:45

标签: python beautifulsoup

以下是我的方案,我想在td标记中获取tr子标记和内容。我能够获取内容而不是标签,因为里面有太多元素。

回报应该是:

  1. p标记及其内容
  2. table元素
  3. HTML:

    <table>
        <tr>
    
            <td>
            <!-- first element -->
                <p> MY TEXT </p>
            <!-- end element -->
            </td>
    
            <td>
            <!-- second element -->
                <table>
                    <tbody>
                        <tr>
                            <td>
                                <p> MY TEXT </p>
                            </td>
                            <td>
                                <p> MY TEXT </p>
                            </td>
                        </tr>
                        <tr>
                            <td>
                                <p> MY TEXT </p>
                            </td>
                        </tr>
                    </tbody>
                </table>
            <!-- end element -->
            </td>
    
        </tr>
    </table>
    

1 个答案:

答案 0 :(得分:1)

代码:

from bs4 import BeautifulSoup

html = '''
<table>
    <tr>
        <td>
        <!-- first element -->
            <p> MY TEXT </p>
        <!-- end element -->
        </td>
        <td>
        <!-- second element -->
            <table>
                <tbody>
                    <tr>
                        <td>
                            <p> MY TEXT </p>
                        </td>
                        <td>
                            <p> MY TEXT </p>
                        </td>
                    </tr>
                    <tr>
                        <td>
                            <p> MY TEXT </p>
                        </td>
                    </tr>
                </tbody>
            </table>
        <!-- end element -->
        </td>
    </tr>
</table>
'''

soup = BeautifulSoup(html, 'html.parser')
print("The <p> tag with it's content:")
print(soup.find_all('p'))
print("\nThe <table> element:")
print(soup.find('table').prettify())

输出:

The <p> tag with it's content:
[<p> MY TEXT </p>, <p> MY TEXT </p>, <p> MY TEXT </p>, <p> MY TEXT </p>]

The <table> element:
<table>
 <tr>
  <td>
   <!-- first element -->
   <p>
    MY TEXT
   </p>
   <!-- end element -->
  </td>
  <td>
   <!-- second element -->
   <table>
    <tbody>
     <tr>
      <td>
       <p>
        MY TEXT
       </p>
      </td>
      <td>
       <p>
        MY TEXT
       </p>
      </td>
     </tr>
     <tr>
      <td>
       <p>
        MY TEXT
       </p>
      </td>
     </tr>
    </tbody>
   </table>
   <!-- end element -->
  </td>
 </tr>
</table>