使用美丽的汤从各种标签中提取标题

时间:2019-06-25 10:32:04

标签: python html python-3.x beautifulsoup scrapy

如何使用漂亮的汤从下面的html中提取两种表格类型的表格标题

<body>
    <p>some other data 1</p>
    <p>Table1 heading</p>
    <div></div>
    <div>
        <div><table width="15%"><tbody>
            <tr>
                <td><p>data1_00</p></td>
                <td><p>data1_01</p></td>
            </tr>
            <tr>
                <td><p>data1_10</p></td>
                <td><p>data1_11</p></td>
            </tr>
        </tbody></table></div>
    </div>

    <br><br>

    <div>some other data 2</div>
    <div>Table2 heading</div>
    <div>
        <div><table width="15%"><tbody>
            <tr>
                <td><p>data2_00</p></td>
                <td><p>data2_01</p></td>
            </tr>
            <tr>
                <td><p>data2_10</p></td>
                <td><p>data2_11</p></td>
            </tr>
        </tbody></table></div>
    </div>
</body>

在第一个表上,标题位于<p>标记内,在第二个表上,标题位于<div>标记内。同样在第二张桌子上,在桌子上方有一个空白的<div>标签。
如何提取两个表标题?

当前,我正在使用<div>搜索当前表上方的上一个table.find_previous('div'),其中的文本将另存为标题。

from bs4 import BeautifulSoup
import urllib.request

htmlpage = urllib.request.urlopen(url)
    page = BeautifulSoup(htmlpage, "html.parser")
    all_divtables = page.find_all('table')
    for table in all_divtables:
        curr_div = table
        while True:
            curr_div = curr_div.find_previous('div')
            if len(curr_div.find_all('table')) > 0:
                continue
            else:
                heading = curr_div.text.strip()
                print(heading)
                break
  

所需输出:
  Table1 heading
  Table2 heading

2 个答案:

答案 0 :(得分:3)

您可以将find_previous()函数与lambda参数一起使用,该函数选择不包含其他表且不包含空字符串的第一个先前标记:

data = '''<body>
    <p>some other data 1</p>
    <p>Table1 heading</p>
    <div></div>
    <div>
        <div><table width="15%"><tbody>
            <tr>
                <td><p>data1_00</p></td>
                <td><p>data1_01</p></td>
            </tr>
            <tr>
                <td><p>data1_10</p></td>
                <td><p>data1_11</p></td>
            </tr>
        </tbody></table></div>
    </div>

    <br><br>

    <div>some other data 2</div>
    <div>Table2 heading</div>
    <div>
        <div><table width="15%"><tbody>
            <tr>
                <td><p>data2_00</p></td>
                <td><p>data2_01</p></td>
            </tr>
            <tr>
                <td><p>data2_10</p></td>
                <td><p>data2_11</p></td>
            </tr>
        </tbody></table></div>
    </div>

    <div>some other data 3</div>
    <div>Table3 heading</div>
    <div>
        <div><table width="15%"><tbody>
            <tr>
                <td><p>data2_00z</p></td>
                <td><p>data2_01z</p></td>
            </tr>
            <tr>
                <td><p>data2_10z</p></td>
                <td><p>data2_11z</p></td>
            </tr>
        </tbody></table></div>
    </div>
    <div>
        <div><table width="15%"><tbody>
            <tr>
                <td><p>data2_00x</p></td>
                <td><p>data2_01x</p></td>
            </tr>
            <tr>
                <td><p>data2_10x</p></td>
                <td><p>data2_11x</p></td>
            </tr>
        </tbody></table></div>
    </div>

</body>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'lxml')

for table in soup.select('table'):
    for i in table.find_previous(lambda t: not t.find('table') and t.text.strip() != ''):
        if i.find_parents('table'):
            continue
        print(i)
        print('*' * 80)

打印:

Table1 heading
********************************************************************************
Table2 heading
********************************************************************************
Table3 heading
********************************************************************************

答案 1 :(得分:0)

urldata='''<body>
    <p>some other data 1</p>
    <p>Table1 heading</p>
    <div></div>
    <div>
        <div><table width="15%"><tbody>
            <tr>
                <td><p>data1_00</p></td>
                <td><p>data1_01</p></td>
            </tr>
            <tr>
                <td><p>data1_10</p></td>
                <td><p>data1_11</p></td>
            </tr>
        </tbody></table></div>
    </div>

    <br><br>

    <div>some other data 2</div>
    <div>Table2 heading</div>
    <div>
        <div><table width="15%"><tbody>
            <tr>
                <td><p>data2_00</p></td>
                <td><p>data2_01</p></td>
            </tr>
            <tr>
                <td><p>data2_10</p></td>
                <td><p>data2_11</p></td>
            </tr>
        </tbody></table></div>
    </div>
</body>'''

import re
from bs4 import BeautifulSoup
import urllib.request
soup = BeautifulSoup(data, 'lxml')

results =soup.body.findAll(text=re.compile('heading'))
for result in results:
    print(result)

**Output:-**

Table1 heading
Table2 heading