解析Python列表

时间:2014-02-21 01:40:47

标签: python arrays json dictionary tuples

我有一个JSON数据类型raw.json

{"time": 12.640, "name": "machine1", "value": 24.0}
{"time": 12.645, "name": "machine2", "value": 0.0}
{"time": 12.65002, "name": "machine3", "value": true}
{"time": 12.66505, "name": "machine4", "value": 1.345}
{"time": 12.67007, "name": "machine5", "value": 5.068}
{"time": 12.67508, "name": "machine4", "value": 1.075}
{"time": 12.6801, "name": "machine5", "value": 2.0868}
{"time": 12.6851, "name": "machine4", "value": 0.0}
{"time": 12.6901, "name": "machine5", "value": 12.633}
{"time": 12.69512, "name": "machine5", "value": 13.13}
{"time": 12.70013, "name": "machine3", "value": false}
{"time": 12.70515, "name": "machine3", "value": false}
{"time": 12.71016, "name": "machine3", "value": false}
{"time": 12.71517, "name": "machine5", "value": 131.633}

所以在我的python脚本中,我能够逐行生成并生成列表

import json

data = [];
timestamp =[];
with open('raw.json') as f:
    for line in f:
       data.append(json.loads(line))
    f.close()

for idx, val in enumerate(data):
   time = data[idx]['time']
   name = data[idx]['name']
   value = data[idx]['value']
   data_list = idx+1, time, name, value
   print data_list

输出:

(1, 12.64, u'machine1', 24.0)
(2, 12.645, u'machine2', 0.0)
(3, 12.65002, u'machine3', True)
(4, 12.66505, u'machine4', 1.345)
(5, 12.67007, u'machine5', 5.068)
(6, 12.67508, u'machine4', 1.075)
(7, 12.6801, u'machine5', 2.0868)
(8, 12.6851, u'machine4', 0.0)
(9, 12.6901, u'machine5', 12.633)
(10, 12.69512, u'machine5', 13.13)
(11, 12.70013, u'machine3', False)
(12, 12.70515, u'machine3', False)
(13, 12.71016, u'machine3', False)
(14, 12.71517, u'machine5', 131.633)

我想对这些数据进行排序,以便我可以使用我可以使用的单个列表(数组)。 e.g。

machine1 = [12.640, 24.0];
machine2 = [12.645, 0.0];
machine3 = [
12.65002,true
12.70013,false
12.70515,false
12.71016,false
]; 
machine4 = [
12.66505 1.345
12.67508 1.075
12.6851 0.0
];

依此类推,我还可以直接搜索这个元组或列表来生成元数据,例如machine1,machine 2等的sum / average。

Sum_Machine1 = 24;
Sum_Machine2 = 0;....

2 个答案:

答案 0 :(得分:2)

第一个解决方案

以下是我解决问题的方法:

import json
import collections

if __name__ == '__main__':    
    # Load file into data
    with open('raw.json') as f:
        data = [json.loads(line) for line in f]

    # Calculate count and total
    time_total = collections.defaultdict(float)
    time_count = collections.defaultdict(int)
    for row in data:
        time_count[row['name']] += 1
        time_total[row['name']] += row['time']

    # Calculate average
    time_average = {}
    for name in time_count:
        time_average[name] = time_total[name] / time_count[name]

    # Report
    for name in sorted(time_count):
        print '{:<10} {:2} {:8.2f} {:8.2f}'.format(
            name,
            time_count[name],
            time_total[name],
            time_average[name])

讨论

  • datadict的列表,其中包含 name time ,...
  • 等密钥
  • 我使用了三个额外的词典来跟踪每台机器的计数,总数和平均值。
  • 我假设您希望根据时间值进行计算。如果没有,这很容易解决。
  • defaultdict是计算数字的好方法。如果尚未创建int值,则将创建它并将值赋值为0,非常方便。你应该查阅它。

第二个解决方案

这是一种不同的方法:由于您的数据看起来像一个表,为什么不使用数据库来处理您的数据。这种方法的优点是您不必自己进行计算。

import json
import sqlite3

if __name__ == '__main__':
    # Create an in-memory database for calculation
    connection = sqlite3.connect(':memory:')
    cursor = connection.cursor()
    cursor.execute('DROP TABLE IF EXISTS time_table')
    cursor.execute('CREATE TABLE time_table (name text, time real)')
    connection.commit()

    # Load file into database
    with open('raw.json') as f:
        for line in f:
            row = json.loads(line)
            cursor.execute('INSERT INTO time_table VALUES (?,?)', (row['name'], row['time']))
            connection.commit()

    # Report: print the name, count, sum, and average
    cursor.execute('SELECT name, COUNT(time), SUM(time), AVG(time) FROM time_table GROUP BY name')
    print '%-10s %8s %8s %8s' % ('NAME', 'COUNT', 'SUM', 'AVERAGE')
    for row in cursor.fetchall():
        print '%-10s %8d %8.2f %8.2f' % row

    connection.close()

输出

NAME          COUNT      SUM  AVERAGE
machine1          1    12.64    12.64
machine2          1    12.64    12.64
machine3          4    50.77    12.69
machine4          3    38.03    12.68
machine5          5    63.45    12.69

讨论

  • 在此解决方案中,我创建了一个内存 SQLite3 数据库
  • 由于我们只对名称时间列感兴趣,因此该表仅包含这两个列。
  • 我们免费获得了所有统计函数,例如SUMCOUNTAVG,只需使用数据库。

加入第一个解决方案

回答这个问题:给定 machine5 ,我怎样才能得到最后一个值?通过这种方式,我假设您要将数据过滤到包含 machine5 的数据,然后按时间排序并选择最后一行。对于第一个解决方案,请附加以下代码块并运行它:

# Filter data: prints all rows with 'machine5'
print '\nFilter by machine5'
machine5 = [row for row in data if row['name'] == 'machine5']
machine5 = sorted(machine5, key=lambda row: int(row['time']))
pprint(machine5)

# Get the last instance
print '\nLast instance of machine5:'
latest_row = machine5[-1]
pprint(latest_row)

不要忘记在脚本开头添加以下内容:

from pprint import pprint

输出

Filter by machine5
[{u'name': u'machine5', u'time': 12.67007, u'value': 5.068},
 {u'name': u'machine5', u'time': 12.6801, u'value': 2.0868},
 {u'name': u'machine5', u'time': 12.6901, u'value': 12.633},
 {u'name': u'machine5', u'time': 12.69512, u'value': 13.13},
 {u'name': u'machine5', u'time': 12.71517, u'value': 131.633}]

Last instance of machine5:
{u'name': u'machine5', u'time': 12.71517, u'value': 131.633}

讨论

如果您不想按时间对行进行排序,请删除sorted()行,这将为您提供未排序的输出。

答案 1 :(得分:1)

让每一行成为一个类(不是绝对必要但很好),重载 cmp 并使用sort

class MachineInfo:

    def __init__(self, info_time, name, value):
        self.info_time = info_time
        self.name = name
        self.value = value

def cmp_machines(a, b):
    return cmp(a.name, b.name)

sort也需要一个可选的比较函数..

info = [... fill this with MachineInfo instances here ...]

# then call 
info = sorted(info, cmp_machines)

# or to sort in place
info.sort(cmp_machines)

# alternatively add a  __cmp__ method to MachineInfo and that will get used by default

有更好的方法可以做到这一点.. https://wiki.python.org/moin/HowTo/Sorting 但是保持简单明了很好。