Question

我有一个html文档，我想从这个文档中拉出表格并将它们作为数组返回。我正在想象两个函数，一个用于查找文档中的所有html表，另一个用于将html表转换为二维数组。

这样的事情：

htmltables = get_tables(htmldocument)
for table in htmltables:
    array=make_array(table)

有2次捕获： 1.数字表每天都有所不同这些表有各种奇怪的额外格式，比如粗体和闪烁标签，随机抛出。

谢谢！

Answer 1

使用BeautifulSoup（我推荐3.0.8）。查找所有表格是微不足道的：

import BeautifulSoup

def get_tables(htmldoc):
    soup = BeautifulSoup.BeautifulSoup(htmldoc)
    return soup.findAll('table')

然而，在Python中，array是1维的，并且被约束为非常基本的类型作为项（整数，浮点数，基础）。因此，无法在Python array中挤压HTML表。

也许你的意思是Python list而已？这也是一维的，但任何东西都可以是一个项目，所以你可以有一个列表列表（每个tr标签有一个子列表，我想，每td个标签包含一个项目。）

这会给：

def makelist(table):
  result = []
  allrows = table.findAll('tr')
  for row in allrows:
    result.append([])
    allcols = row.findAll('td')
    for col in allcols:
      thestrings = [unicode(s) for s in col.findAll(text=True)]
      thetext = ''.join(thestrings)
      result[-1].append(thetext)
  return result

这可能还不是你想要的（不会跳过HTML注释，子列表的项目是unicode字符串而不是字节字符串等）但它应该很容易调整。

Answer 2

问问者的+1，Python的另一个人想要使用lxml和CSS选择器来尝试这个例子是的，这与Alex的例子大致相同：

import lxml.html
markup = lxml.html.fromstring('''<html><body>\
<table width="600">
    <tr>
        <td width="50%">0,0,0</td>
        <td width="50%">0,0,1</td>
    </tr>
    <tr>
        <td>0,1,0</td>
        <td>0,1,1</td>
    </tr>
</table>
<table>
    <tr>
        <td>1,0,0</td>
        <td>1,<blink>0,</blink>1</td>
        <td>1,0,2</td>
        <td><bold>1</bold>,0,3</td>
    </tr>
</table>
</body></html>''')

tbl = []
rows = markup.cssselect("tr")
for row in rows:
  tbl.append(list())
  for td in row.cssselect("td"):
    tbl[-1].append(unicode(td.text_content()))

pprint(tbl)
#[[u'0,0,0', u'0,0,1'],
# [u'0,1,0', u'0,1,1'],
# [u'1,0,0', u'1,0,1', u'1,0,2', u'1,0,3']]

Answer 3

Pandas可以将html中的所有表格提取到开箱即用的数据框列表中，从而使您不必自己解析页面（重新发明轮子）。 DataFrame是一种强大的二维数组。

我建议继续使用Pandas处理数据，因为它是一个很棒的工具，但如果您愿意，还可以转换为其他格式（列表，字典，csv文件等）。

示例

"""Extract all tables from an html file, printing and saving each to csv file.""" import pandas as pd df_list = pd.read_html('my_file.html') for i, df in enumerate(df_list): print df df.to_csv('table {}.csv'.format(i))

直接从网络而不是从文件中获取html内容只需要稍作修改：

import requests html = requests.get('my_url').content df_list = pd.read_html(html)

如何在python中将HTML表转换为数组

3 个答案: