Question

BeautifulSoup可以选择没有标签表吗？ HTML中有很多表，但我想要的数据在表中没有任何标签。

这是我的例子： HTML中有2个表。一个是英语，另一个是数字。

from bs4 import BeautifulSoup

HTML2 = """
<table>
    <tr>
        <td class>a</td>
        <td class>b</td>
        <td class>c</td>
        <td class>d</td>
    </tr>
    <tr>
        <td class>e</td>
        <td class>f</td>
        <td class>g</td>
        <td class>h</td>
    </tr>
</table>

<table cellpadding="0">
    <tr>
        <td class>111</td>
        <td class>222</td>
        <td class>333</td>
        <td class>444</td>
    </tr>
    <tr>
        <td class>555</td>
        <td class>666</td>
        <td class>777</td>
        <td class>888</td>
    </tr>
"""
soup2 = BeautifulSoup(HTML2, 'html.parser')
f2 = soup2.select('table[cellpadding!="0"]') #<---I think the key point is here.
for div in f2:
    row = ''
    rows = div.findAll('tr')
    for row in rows:
        if(row.text.find('td') != False):
            print(row.text)

我只想要“英语”表中的数据并制作如下格式：

a b c d
e f g h

然后保存到excel。

但我只能访问该“数字”表。有提示吗？谢谢！

Answer 1

您可以使用find_all并仅选择没有特定属性的表。

f2 = soup2.find_all('table', {'cellpadding':None})

或者如果要选择绝对没有属性的表：

f2 = [tbl for tbl in soup2.find_all('table') if not tbl.attrs]

<小时/> 然后，您可以列出f2中的列列表并将其传递给数据帧。

data = [ 
    [td.text for td in tr.find_all('td')] 
    for table in f2 for tr in table.find_all('tr') 
]

Answer 2

您可以使用 has_attr 方法测试 table 是否包含 cellpadding 属性：

soup2 = BeautifulSoup(HTML2, 'html.parser')
f2 = soup2.find_all('table')
for div in f2:
    if not div.has_attr('cellpadding'):
        row = ''
        rows = div.findAll('tr')
        for row in rows:
            if(row.text.find('td') != False):
                print(row.text)

选择Beautifulsoup没有标签的桌子

2 个答案: