除标签外的BeautifulSoup HTML

时间:2018-01-19 08:02:03

标签: python html beautifulsoup

<tbody>
  <tr class="abc bg1">...</tr>
  <tr class="bg1">...</tr>
    <td> class="no">...</td>
    <td>sampletext</td>
    <td> class="title">...</td>
  <tr class="bg2">...</tr>

此示例代码有3个类'abc bg1','bg1','bg2' 我只想要'bg1','bg2'标签 所以我使用了soup.select('tbody > tr.bg1 > td')

此代码导致'abc bg1','bg1'标记儿童'td' 我如何得到我想要的结果? 对于'bg1',我想只提取除其他标签之外的文本 前): sampletext&lt; - only

2 个答案:

答案 0 :(得分:0)

from bs4 import BeautifulSoup

html_str = """<tbody>
  <tr class="abc bg1">...</tr>
  <tr class="bg1">...</tr>
    <td> class="no">...</td>
    <td>sampletext</td>
    <td> class="title">...</td>
  <tr class="bg2">...</tr><tobdy>"""

soup = BeautifulSoup(html_str)
bg1 = soup.findAll('tr', attrs= {'class':'bg1'})[1].text

如果您使用.findAll,它会找到具有该类名的所有attrs。它给你一个数组;然后只需为你想要的tr调用数组索引。

<强>更新 如果你想要bg1里面的元素;打电话给另一个.find。像这样: sample_text = soup.findAll('td')[1].text#这会为您提供&#34;示例文字&#34;。

答案 1 :(得分:0)

这是识别所有具有'bg1'OR'bg2'但不是'abc'的标签的方法:

from bs4 import BeautifulSoup

html_doc = '''<tbody>
    <tr class="abc bg1">...</tr>
    <tr class="bg1">...</tr>
        <td> class="no">...</td>
        <td>sampletext</td>
        <td> class="title">...</td>
    <tr class="bg2">...</tr>
</tbody>'''

soup = BeautifulSoup(html_doc, html.parser)


# We can look for all tags that are "tr" tags.
for tag in soup.find_all('tr'):

    # Each tag has attributes. We can reference the attrs dictionary
    #     using the attribute name as the key.
    if 'abc' in tag.attrs['class']:
        continue
    else:
        print(tag)

<tr class="bg1">...</tr>
<tr class="bg2">...</tr>