我有一个JSON数据类型raw.json
{"time": 12.640, "name": "machine1", "value": 24.0}
{"time": 12.645, "name": "machine2", "value": 0.0}
{"time": 12.65002, "name": "machine3", "value": true}
{"time": 12.66505, "name": "machine4", "value": 1.345}
{"time": 12.67007, "name": "machine5", "value": 5.068}
{"time": 12.67508, "name": "machine4", "value": 1.075}
{"time": 12.6801, "name": "machine5", "value": 2.0868}
{"time": 12.6851, "name": "machine4", "value": 0.0}
{"time": 12.6901, "name": "machine5", "value": 12.633}
{"time": 12.69512, "name": "machine5", "value": 13.13}
{"time": 12.70013, "name": "machine3", "value": false}
{"time": 12.70515, "name": "machine3", "value": false}
{"time": 12.71016, "name": "machine3", "value": false}
{"time": 12.71517, "name": "machine5", "value": 131.633}
所以在我的python脚本中,我能够逐行生成并生成列表
import json
data = [];
timestamp =[];
with open('raw.json') as f:
for line in f:
data.append(json.loads(line))
f.close()
for idx, val in enumerate(data):
time = data[idx]['time']
name = data[idx]['name']
value = data[idx]['value']
data_list = idx+1, time, name, value
print data_list
输出:
(1, 12.64, u'machine1', 24.0)
(2, 12.645, u'machine2', 0.0)
(3, 12.65002, u'machine3', True)
(4, 12.66505, u'machine4', 1.345)
(5, 12.67007, u'machine5', 5.068)
(6, 12.67508, u'machine4', 1.075)
(7, 12.6801, u'machine5', 2.0868)
(8, 12.6851, u'machine4', 0.0)
(9, 12.6901, u'machine5', 12.633)
(10, 12.69512, u'machine5', 13.13)
(11, 12.70013, u'machine3', False)
(12, 12.70515, u'machine3', False)
(13, 12.71016, u'machine3', False)
(14, 12.71517, u'machine5', 131.633)
我想对这些数据进行排序,以便我可以使用我可以使用的单个列表(数组)。 e.g。
machine1 = [12.640, 24.0];
machine2 = [12.645, 0.0];
machine3 = [
12.65002,true
12.70013,false
12.70515,false
12.71016,false
];
machine4 = [
12.66505 1.345
12.67508 1.075
12.6851 0.0
];
依此类推,我还可以直接搜索这个元组或列表来生成元数据,例如machine1,machine 2等的sum / average。
Sum_Machine1 = 24;
Sum_Machine2 = 0;....
答案 0 :(得分:2)
以下是我解决问题的方法:
import json
import collections
if __name__ == '__main__':
# Load file into data
with open('raw.json') as f:
data = [json.loads(line) for line in f]
# Calculate count and total
time_total = collections.defaultdict(float)
time_count = collections.defaultdict(int)
for row in data:
time_count[row['name']] += 1
time_total[row['name']] += row['time']
# Calculate average
time_average = {}
for name in time_count:
time_average[name] = time_total[name] / time_count[name]
# Report
for name in sorted(time_count):
print '{:<10} {:2} {:8.2f} {:8.2f}'.format(
name,
time_count[name],
time_total[name],
time_average[name])
data
是dict
的列表,其中包含 name , time ,... defaultdict
是计算数字的好方法。如果尚未创建int值,则将创建它并将值赋值为0,非常方便。你应该查阅它。这是一种不同的方法:由于您的数据看起来像一个表,为什么不使用数据库来处理您的数据。这种方法的优点是您不必自己进行计算。
import json
import sqlite3
if __name__ == '__main__':
# Create an in-memory database for calculation
connection = sqlite3.connect(':memory:')
cursor = connection.cursor()
cursor.execute('DROP TABLE IF EXISTS time_table')
cursor.execute('CREATE TABLE time_table (name text, time real)')
connection.commit()
# Load file into database
with open('raw.json') as f:
for line in f:
row = json.loads(line)
cursor.execute('INSERT INTO time_table VALUES (?,?)', (row['name'], row['time']))
connection.commit()
# Report: print the name, count, sum, and average
cursor.execute('SELECT name, COUNT(time), SUM(time), AVG(time) FROM time_table GROUP BY name')
print '%-10s %8s %8s %8s' % ('NAME', 'COUNT', 'SUM', 'AVERAGE')
for row in cursor.fetchall():
print '%-10s %8d %8.2f %8.2f' % row
connection.close()
NAME COUNT SUM AVERAGE
machine1 1 12.64 12.64
machine2 1 12.64 12.64
machine3 4 50.77 12.69
machine4 3 38.03 12.68
machine5 5 63.45 12.69
SUM
,COUNT
和AVG
,只需使用数据库。回答这个问题:给定 machine5 ,我怎样才能得到最后一个值?通过这种方式,我假设您要将数据过滤到包含 machine5 的数据,然后按时间排序并选择最后一行。对于第一个解决方案,请附加以下代码块并运行它:
# Filter data: prints all rows with 'machine5'
print '\nFilter by machine5'
machine5 = [row for row in data if row['name'] == 'machine5']
machine5 = sorted(machine5, key=lambda row: int(row['time']))
pprint(machine5)
# Get the last instance
print '\nLast instance of machine5:'
latest_row = machine5[-1]
pprint(latest_row)
不要忘记在脚本开头添加以下内容:
from pprint import pprint
Filter by machine5
[{u'name': u'machine5', u'time': 12.67007, u'value': 5.068},
{u'name': u'machine5', u'time': 12.6801, u'value': 2.0868},
{u'name': u'machine5', u'time': 12.6901, u'value': 12.633},
{u'name': u'machine5', u'time': 12.69512, u'value': 13.13},
{u'name': u'machine5', u'time': 12.71517, u'value': 131.633}]
Last instance of machine5:
{u'name': u'machine5', u'time': 12.71517, u'value': 131.633}
如果您不想按时间对行进行排序,请删除sorted()
行,这将为您提供未排序的输出。
答案 1 :(得分:1)
让每一行成为一个类(不是绝对必要但很好),重载 cmp 并使用sort
class MachineInfo:
def __init__(self, info_time, name, value):
self.info_time = info_time
self.name = name
self.value = value
def cmp_machines(a, b):
return cmp(a.name, b.name)
sort也需要一个可选的比较函数..
info = [... fill this with MachineInfo instances here ...]
# then call
info = sorted(info, cmp_machines)
# or to sort in place
info.sort(cmp_machines)
# alternatively add a __cmp__ method to MachineInfo and that will get used by default
有更好的方法可以做到这一点.. https://wiki.python.org/moin/HowTo/Sorting 但是保持简单明了很好。