Question

My local airport可耻地阻止没有IE的用户，看起来很糟糕。我想编写一个Python脚本，每隔几分钟就能获得到达和离开页面的内容，并以更易读的方式显示它们。

我选择的工具是mechanize，用于欺骗网站以相信我使用IE，而BeautifulSoup用于解析网页以获取航班数据表。

老实说，我迷失在BeautifulSoup文档中，无法理解如何从整个文档中获取表（我知道他的标题），以及如何从该表中获取行列表。

有什么想法吗？

Answer 1

这不是您需要的特定代码，只是演示如何使用BeautifulSoup。它找到id为“Table1”的表，并获取其所有tr元素。

html = urllib2.urlopen(url).read()
bs = BeautifulSoup(html)
table = bs.find(lambda tag: tag.name=='table' and tag.has_attr('id') and tag['id']=="Table1") 
rows = table.findAll(lambda tag: tag.name=='tr')

Answer 2

soup = BeautifulSoup(HTML)

# the first argument to find tells it what tag to search for
# the second you can pass a dict of attr->value pairs to filter
# results that match the first tag
table = soup.find( "table", {"title":"TheTitle"} )

rows=list()
for row in table.findAll("tr"):
   rows.append(row)

# now rows contains each tr in the table (as a BeautifulSoup object)
# and you can search them to pull out the times

Answer 3

以下是通用<table>的工作示例。 （由于加载表数据需要执行JavaScript，因此未使用您的页面）

按国家从here GDP（国内生产总值）中提取表格数据。

table = soup.find('table', { 'class' : 'table table-striped' })
# where the dictionary specify unique attributes for the 'table' tag

在主要tableDataText函数下面，解析一个以标签<table>开头的html段，然后是多个<tr>（表行）和内部<td>（表数据）标签。它返回带有内部列的行的列表。第一行只接受一个<th>（表头/数据）。

def rowgetDataText(tr, coltag='td'): # td (data) or th (header)
    cols = []
    for td in tr.find_all(coltag):
        cols.append(td.get_text(strip=True))
    return cols

def tableDataText(table):       
    rows = []
    trs = table.find_all('tr')
    headerow = rowgetDataText(trs[0], 'th')
    if headerow: # if there is a header row include first
        rows.append(headerow)
        trs = trs[1:]
    for tr in trs: # for every table row
        rows.append(rowgetDataText(tr, 'td')) # data row
    return rows

使用它，我们得到（前两行）。

list_table = tableDataText(htmltable)
list_table[:2]

[['Rank',
  'Name',
  "GDP (IMF '19)",
  "GDP (UN '16)",
  'GDP Per Capita',
  '2019 Population'],
 ['1',
  'United States',
  '21.41 trillion',
  '18.62 trillion',
  '$65,064',
  '329,064,917']]

可以轻松地将其转换为pandas.DataFrame，以进行更高级的操作。

import pandas as pd

dftable = pd.DataFrame(list_table[1:], columns=list_table[0])
dftable.head(4)

Answer 4

如果您关心，BeautifulSoup将不再维护，原始维护者建议转换为lxml。 Xpath应该很好地完成这个技巧。

BeautifulSoup：获取特定表的内容

4 个答案: