BeautifulSoup-将多个元素合并到一个列表中

时间:2018-12-01 00:53:33

标签: python python-3.x beautifulsoup lxml

我正在使用漂亮的汤来解析Python对象中的HTML文档,但是遇到了一个小问题。

我正在尝试将表转换为字典列表。我希望字典中的键是列标题,但是表中有多个标题行,其中th个元素的数量不同。为了使字典键有效,我需要以某种方式将两个标题行合并为自己的串联版本。

这是标题行的样子。 Source HTML

这是基础HTML

<thead>
   <tr>
      <th></th>
      <th class="metadata platform"></th>
      <th class="wtt time borderleft" colspan="2"><abbr title="Working Timetable">WTT</abbr></th>
      <th class="gbtt time borderleft" colspan="2"><abbr title="Public Timetable (Great Britain Timetable)">GBTT</abbr></th>
      <th class="metadata line path borderleft" colspan="2">Route</th>
      <th class="metadata allowances borderleft" colspan="3">Allowances</th>
   </tr>
   <tr>
      <th>Location</th>
      <th class="metadata platform span2">Pl</th>
      <th class="wtt time span3 borderleft">Arr</th>
      <th class="wtt time span3">Dep</th>
      <th class="gbtt time span3 borderleft">Arr</th>
      <th class="gbtt time span3">Dep</th>
      <th class="metadata line span2 borderleft">Line</th>
      <th class="metadata path span2">Path</th>
      <th class="metadata allowances engineering span2 borderleft"><abbr title="Engineering allowance">Eng</abbr></th>
      <th class="metadata allowances pathing span2"><abbr title="Pathing allowance">Pth</abbr></th>
      <th class="metadata allowances performance span2"><abbr title="Performance allowance">Prf</abbr></th>
   </tr>
</thead>

理想情况下,这是我需要的输出,因此我可以进行一些字典理解以建立列表。

['Location', 'Pl', 'WTT Arr', 'WTT Dep', 'GBTT Arr', 
 'GBTT Dep', 'Route Line', 'Route Path', 'Allowances Eng', 
 'Allowances Pth', 'Allowances Prf']

我认为做到这一点的唯一方法是遍历每个th元素并以此方式构建标头。因此,在这里,我最终得到了11个元素的列表,这些元素需要两个“ pass”来构建。

# First pass
['', '', 'WTT', 'WTT', 'GBTT', 
 'GBTT', 'Route', 'Route', 'Allowances ', 
 'Allowances', 'Prf']

# Second pass
['Location', 'Pl', 'WTT Arr', 'WTT Dep', 'GBTT Arr', 
 'GBTT Dep', 'Route Line', 'Route Path', 'Allowances Eng', 
 'Allowances Pth', 'Allowances Prf']

虽然这是一个可行的解决方案,但我想认为还有一种更Python化的方法。

编辑:用于创建字典键的代码:

from bs4 import BeautifulSoup
import requests

url = 'http://www.realtimetrains.co.uk/train/P16871/2018/12/10/advanced'

bs = BeautifulSoup(requests.get(url).content, 'lxml')
table = bs.find_all('table', class_='advanced')
headers = table[0].select('thead tr ')

keys = []
for th in headers[0].findChildren('th'):
    keys.append(th.getText())
    try:
        colspan = int(th['colspan'])
        if colspan > 0:
            for i in range(0, colspan-1):
                keys.append(th.getText())
    except KeyError:
        pass

th_elements = list(headers[1].findChildren('th'))
for i in range(0, len(keys)):
    keys[i] = keys[i] + ' ' + th_elements[i].getText()
    keys[i] = keys[i].strip()

print(keys)

1 个答案:

答案 0 :(得分:1)

作为一种替代方法,您可以使用熊猫read_html(也使用BeautifulSoup)。将html读入数据框flatten the column names,然后将结果输出到字典列表。

import pandas as pd

df = pd.read_html('http://www.realtimetrains.co.uk/train/P16871/2018/12/10/advanced')[0]
df.columns = [' '.join([c for c in col if 'Unnamed' not in c]) 
              for col in df.columns.values]
df.to_dict(orient='records')

给予:

[
  {
    'Location': 'Swansea [SWA]',
    'Pl': 3.0,
    'WTT Arr': nan,
    'GBTT Dep': 911.0,
    'Route Arr': nan,
    'Allowances Dep': 910.0,
    'Line': nan,
    'Path': nan,
    'Eng': nan,
    'Pth': nan,
    'Prf': nan
  }, 
  ...
]