将一个html表转换为python中的json对象

时间:2017-01-11 10:25:23

标签: python html json dictionary python-2.x

我试图将html表转换为json对象并将其写入文件。基本上它是从打印机设置页面获取的打印机数据。

这是我的代码,

html_data = urllib2.urlopen('http://192.168.2.198/sys_count.html')
soup = BeautifulSoup(html_data,"lxml")

table_data = [[cell.text for cell in row("td")]
                         for row in soup.body.find_all('table', attrs={'class' : 'matrix'})]



with open('/home/abc/Desktop/JsonData.txt', 'w') as outfile:
   json.dump(table_data, outfile, sort_keys = 'true', indent = 4, separators=(',',':'),
 ensure_ascii=False)

输出是:

[
    [
        "Black & White",
        "79555",
        "Full Colour",
        "0"
    ],
    [
        "Copy",
        "30697",
        "Printer",
        "48798",
        "Others",
        "60",
        "Scan Send",
        "Black & White",
        "648",
        "Full Colour",
        "747"
    ],
    [
        "Document Feeder",
        "11709",
        "Duplex",
        "13799"
    ]
]

我希望输出为 -

{
    {
        "Black & White":        "79555",
        "Full Colour":        "0"
    },

依此类推其他表......

尝试了很多方法,但在尝试将列表转换为dict时遇到类型错误。需要帮助。

仅供参考,我使用的是python 2.7。如果这有帮助。

添加打印机页面的图片以获取更多详细信息。

Snapshot of printer setup page

3 个答案:

答案 0 :(得分:2)

您可以尝试:

table_data = [
    [
        "Black & White",
        "79555",
        "Full Colour",
        "0"
    ],
    [
        "Copy",
        "30697",
        "Printer",
        "48798",
        "Others",
        "60",
        "Scan Send",
        "Black & White",
        "648",
        "Full Colour",
        "747"
    ],
    [
        "Document Feeder",
        "11709",
        "Duplex",
        "13799"
    ]
]
final_data = []
for data in table_data:
    d = dict([(k, v) for k,v in zip (data[::2], data[1::2])])
    final_data.append(d)


print (final_data)
[{'Black & White': '79555', 'Full Colour': '0'}, {'Others': '60', 'Printer': '48798', '648': 'Full Colour', 'Scan Send': 'Black & White', 'Copy': '30697'}, {'Document Feeder': '11709', 'Duplex': '13799'}]

答案 1 :(得分:1)

如果没有range(len()),可能会有更好的版本。

table_data = [
    [
        "Black & White",
        "79555",
        "Full Colour",
        "0"
    ],
    [
        "Copy",
        "30697",
        "Printer",
        "48798",
        "Others",
        "60",
        "Scan Send",
        "Black & White",
        "648",
        "Full Colour",
        "747"
    ],
    [
        "Document Feeder",
        "11709",
        "Duplex",
        "13799"
    ]
]

result = []

for row in table_data:
    d = dict()
    for i in range(0, len(row)-1, 2):
        d[row[i]] = row[i+1]
    result.append(d)

结果:

[ 
  {
    'Black & White': '79555', 
    'Full Colour': '0'
  },
  {
    '648': 'Full Colour',
    'Copy': '30697',
    'Others': '60',
    'Printer': '48798',
    'Scan Send': 'Black & White'
  },
  {
    'Document Feeder': '11709', 
    'Duplex': '13799'
  }
]

但似乎有些数据可能不正确,所以它现在给出'648': 'Full Colour'看起来很奇怪。也许您必须先从数据中删除"Scan Send"

答案 2 :(得分:0)

正如我在评论中所说,这项任务很棘手,因为您提取表数据的方式会丢失原始HTML表中的重要结构信息。

但无论如何......这是一个处理"Scan Send"的解决方案。如果还有其他键(例如"Scan Send")引入了子字典,则可以将它们添加到special_keys集。

list_to_dictrow列表递归创建字典。首先,它从row创建一个迭代器对象。我们从迭代器&中获得下一个项目。假设它是关键。如果它是一个特殊键,我们将对该行的其余部分进行递归,使用返回的dict作为该键的值。否则,我们只是从迭代器中获取下一个字符串,并将其用作当前键的值。

import json

table_data = [
    ["Black & White", "79555", "Full Colour", "0"],
    ["Copy", "30697", "Printer", "48798", "Others", "60", 
        "Scan Send", "Black & White", "648", "Full Colour", "747"],
    ["Document Feeder", "11709", "Duplex", "13799"]
]

special_keys = {"Scan Send"}

def list_to_dict(row):
    d = {}
    it = iter(row)
    for s in it:
        if s in special_keys:
            v = list_to_dict(it)
        else:
            v = next(it)
        d[s] = v
    return d

table_dicts = [list_to_dict(row) for row in table_data]
print(json.dumps(table_dicts, sort_keys=True, indent = 4))

<强>输出

[
    {
        "Black & White": "79555",
        "Full Colour": "0"
    },
    {
        "Copy": "30697",
        "Others": "60",
        "Printer": "48798",
        "Scan Send": {
            "Black & White": "648",
            "Full Colour": "747"
        }
    },
    {
        "Document Feeder": "11709",
        "Duplex": "13799"
    }
]

如果我们可以保证给定行中特殊键后面的所有项都应该进入该特殊键的子字典,那么这个策略只能工作。如果情况并非如此,那么我们需要一个不同的策略......

FWIW,这是list_to_dict的字典理解版本:

def list_to_dict(row):
    it = iter(row)
    return {s: list_to_dict(it) if s in special_keys else next(it) for s in it}