Question

我从html页面获得以下示例HTML表格。

＆＃13;

<table id="fullRecordTable" valign="bottom" cellpadding="3" cellspacing="0" class="yellow" width="100%" summary="Vollanzeige des Suchergebnises">

...

  <tr>
    <td width="25%" class='yellow'>
      <strong>Sachbegriff</strong>
    </td>
    <td class='yellow'>
      Messung
    </td>
  </tr>
  
  <tr>
    ...
  </tr>
  
  <tr>
    ...
  </tr>
  
  <tr>
    ...
  </tr>
  
  <tr>
    <td width="25%" class='yellow'>
      <strong>DDC-Notation</strong>
    </td>
    <td class='yellow'>
      530.8<br/>T1--0287<br/>542.3
    </td>
  </tr>

＆＃13;

我正在尝试打印"DDC-Notation"而不是接下来的三个值："530.8"，"T1--0287"，"542.3"

我的代码是：

soup = BeautifulSoup(data, "html.parser")

talbes = soup.findAll('table', id='fullRecordTable').find_all('tr')

for table in talbes:
    tds = table.find_all('strong')  
    print tds.text

但它并不适用于第一个。

P.S。对不起，这是我的第一篇文章。如果我无法解释我的问题，我会再试一次

Answer 1

如果您使用交互式环境来调试代码，生活就会轻松得多，因为您可以随时寻找所需的内容。

在这种情况下，我知道你想要找到某个字符串。我直接找了那个。

找到它之后，我找到了祖父母，td元素，然后是td的兄弟，另一个是td。

我把它变成了一个名为td的变量，只是为了方便，因为我不确定如何挖出你想要的碎片。

最终我发现children属性包含一个包含所需项目的列表。这只是剥离HTML标签，新线和空白的问题。

>>> import bs4
>>> HTML = open('temp.htm').read()
>>> soup = bs4.BeautifulSoup(HTML, 'lxml')
>>> strong = soup.find_all(string='DDC-Notation')
>>> strong
['DDC-Notation']
>>> strong[0].findParent()
<strong>DDC-Notation</strong>
>>> strong[0].findParent().findParent()
<td class="yellow" width="25%">
<strong>DDC-Notation</strong>
</td>
>>> strong[0].findParent().findParent().findNextSibling()
<td class="yellow">
      530.8<br/>T1--0287<br/>542.3
    </td>
>>> td = strong[0].findParent().findParent().findNextSibling()
>>> td
<td class="yellow">
      530.8<br/>T1--0287<br/>542.3
    </td>
>>> td.children
<list_iterator object at 0x00000000035993C8>
>>> list(td.children)
['\n      530.8', <br/>, 'T1--0287', <br/>, '542.3\n    ']

编辑：今天早上我想到，如果我提供了一个整合的脚本，这个答案可能会对您更有用。在写作中我发现（再次），处理列表中的项目会比看起来更多一些。

当Python输出大多数内容时，它会自动将它们转换为字符串。但是，当您处理HTML元素列表中的项目时，它们将元素而不是字符串，如果您想将它们作为字符串处理，那么您必须首先尝试转换它们，因此需要行` item = str（item）.strip（）'。它将元素转换为字符串并丢弃空格。

import bs4
HTML = open('temp.htm').read()
soup = bs4.BeautifulSoup(HTML, 'lxml')
strong = soup.find_all(string='DDC-Notation')
td = strong[0].findParent().findParent().findNextSibling()
for item in list(td.children):
    item = str(item).strip()
    if item.startswith('<'):
        continue
    print (item)

输出：

530.8
T1--0287
542.3

Python - BeautifulSoup找到强大的td类的下一个值

1 个答案: