Question

我有以下HTML，我正在使用BeautifulSoup进行报废。

<table>
  <tbody>
    <tr>

      <td style="vertical-align: top; text-align: center">
        North America
        <hr>
        USA
        <hr>
        <font color="#990000">NYC</font>
      </td>

   </tr>
  </tbody>
</table>

我想提取和构建它像

{continent: 'North America', country: 'USA', city: 'NYC'}

我正在使用以下方法：

table = soup.find('table')
table.tbody.tr.td.text

它给了我这个：

u'\nNorth America\n\nUSA\n\nNYC\n'

这个输出很好，但我正在寻找一些更好的解决方法。

Answer 1

BeautifulSoup位于lxml之上。使用 lxml ：

即使您删除了 <hr> 标记，此代码也会有效。

from lxml import html

text = """
<table>
  <tbody>
    <tr>

      <td style="vertical-align: top; text-align: center">
        North America
        <hr>
        USA
        <hr>
        <font color="#990000">NYC</font>
      </td>

   </tr>
  </tbody>
</table>
"""

tree = html.fromstring(text)

results = list()
for item in tree.xpath('//tr/td//text()'):
    [results.append(x.strip()) for x in item.split('\n') if x.strip() not in ""]

print results

<强>输出：

['North America', 'USA', 'NYC']

如果你只想要一本字典：

print dict(continent=results[0], country=results[1], city=results[2])

<强>输出：

{'city': 'NYC', 'continent': 'North America', 'country': 'USA'}

Answer 2

首先，您应该安装lxml并使用BeautifulSoup：BeautifulSoup(data, 'lxml')。有关详细信息，请参阅this。它允许修复不正确的<hr>标记。之后一切都应该正常工作：

from bs4 import BeautifulSoup


data = '''<table>
  <tbody>
    <tr>

      <td style="vertical-align: top; text-align: center">
        North America
        <hr>
        USA
        <hr>
        <font color="#990000">NYC</font>
      </td>

   </tr>
  </tbody>
</table>'''

soup = BeautifulSoup(data, 'lxml')

contents = soup.table.tbody.tr.td.contents
res = {
    'continent': contents[0].strip(), 
    'country': contents[2].strip(), 
    'city': contents[5].text.strip()
}

print(res)
# {'country': 'USA', 'continent': 'North America', 'city': 'NYC'}

使用BeautifulSoup在<hr />标记之间提取值的更好方法是什么？

2 个答案: