我有一个具有这种结构的大型 JSON 文件:
[
{
"sniffer_serial":"7c9ebd9448a0",
"serial":"086bd7c39c8c",
"temp":31.36,
"x":-0.484375,
"y":-0.0078125,
"z":-0.859375,
"rssi":-70,
"id":33069,
"date":"2021-07-14 15:45:54.411"
},
{
"sniffer_serial":"7c9ebd945194",
"serial":"086bd7c39c8c",
"temp":31.36,
"x":-0.484375,
"y":-0.0078125,
"z":-0.859375,
"rssi":-70,
"id":33069,
"date":"2021-07-14 15:45:54.414"
},
{
"sniffer_serial":"7c9ebd9448a0",
"serial":"086bd7c39c8c",
"temp":31.36,
"x":-0.484375,
"y":-0.0078125,
"z":-0.859375,
"rssi":-70,
"id":33069,
"date":"2021-07-14 15:45:54.536"
},
{
"sniffer_serial":"7c9ebd945194",
"serial":"086bd7c39c8c",
"temp":31.36,
"x":-0.484375,
"y":-0.0078125,
"z":-0.859375,
"rssi":-70,
"id":33069,
"date":"2021-07-14 15:45:54.539"
},
{
"sniffer_serial":"7c9ebd9448a0",
"serial":"086bd7c39c8c",
"temp":31.36,
"x":-0.484375,
"y":-0.0078125,
"z":-0.859375,
"rssi":-70,
"id":33069,
"date":"2021-07-14 15:45:54.661"
},
{
"sniffer_serial":"7c9ebd945194",
"serial":"086bd7c39c8c",
"temp":31.36,
"x":-0.484375,
"y":-0.0078125,
"z":-0.859375,
"rssi":-70,
"id":33069,
"date":"2021-07-14 15:45:54.664"
},
{
"date": "2021-07-13/10:28:00.930",
"id": 21661,
"rssi": -81,
"serial": "086bd7c39baf",
"sniffer_serial": "7c9ebd9448a0",
"temp": 36.21,
"x": -0.4453125,
"y": -0.1328125,
"z": -0.8671875
},
{
"date": "2021-07-13/10:28:01.680",
"id": 21663,
"rssi": -80,
"serial": "086bd7c39baf",
"sniffer_serial": "7c9ebd9448a0",
"temp": 36.21,
"x": -0.4140625,
"y": -0.1171875,
"z": -0.8515625
},
{
"date": "2021-07-13/10:28:02.60",
"id": 21664,
"rssi": -88,
"serial": "086bd7c39baf",
"sniffer_serial": "7c9ebd9450cc",
"temp": 36.21,
"x": -0.4375,
"y": -0.0546875,
"z": -0.8515625
}
]
如您所见,我有一些重复的值。
id
33069 重复 6 次,即每个 sniffer_serial
重复 3 次,只是它们之间的时间戳不同。
我想知道的是保留相同 id 的前三个出现,并去除其他三个。
在这个例子中,这个重复模式只出现了一次,但它可以在整个文件中出现多次。
到目前为止我得到的是如何只保留每个 id
的第一次出现并将其附加到列表中。
loader = json.loads(myJsonFile)
data = []
for key, items in groupby(sorted(loader, key=lambda x: (x['serial'], x['date'])), key=lambda x: x['id']):
data.append(next(items))
答案 0 :(得分:1)
你可以使用defaultdict
>>> from collections import defaultdict
>>> data = defaultdict(list)
>>> for x in loader:
... if len(data[x['id']]) < 3:
... data[x['id']].append(x)
...
>>> data
defaultdict(<class 'list'>, {33069: [{'sniffer_serial': '7c9ebd9448a0', 'serial': '086bd7c39c8c', 'temp': 31.36, 'x': -0.484375, 'y': -0.0078125, 'z': -0.859375, 'rssi': -70, 'id': 33069, 'date': '2021-07-14 15:45:54.411'}, {'sniffer_serial': '7c9ebd945194', 'serial': '086bd7c39c8c', 'temp': 31.36, 'x': -0.484375, 'y': -0.0078125, 'z': -0.859375, 'rssi': -70, 'id': 33069, 'date': '2021-07-14 15:45:54.414'}, {'sniffer_serial': '7c9ebd9448a0', 'serial': '086bd7c39c8c', 'temp': 31.36, 'x': -0.484375, 'y': -0.0078125, 'z': -0.859375, 'rssi': -70, 'id': 33069, 'date': '2021-07-14 15:45:54.536'}, {'sniffer_serial': '7c9ebd945194', 'serial': '086bd7c39c8c', 'temp': 31.36, 'x': -0.484375, 'y': -0.0078125, 'z': -0.859375, 'rssi': -70, 'id': 33069, 'date': '2021-07-14 15:45:54.539'}, {'sniffer_serial': '7c9ebd9448a0', 'serial': '086bd7c39c8c', 'temp': 31.36, 'x': -0.484375, 'y': -0.0078125, 'z': -0.859375, 'rssi': -70, 'id': 33069, 'date': '2021-07-14 15:45:54.661'}, {'sniffer_serial': '7c9ebd945194', 'serial': '086bd7c39c8c', 'temp': 31.36, 'x': -0.484375, 'y': -0.0078125, 'z': -0.859375, 'rssi': -70, 'id': 33069, 'date': '2021-07-14 15:45:54.664'}], 21661: [{'date': '2021-07-13/10:28:00.930', 'id': 21661, 'rssi': -81, 'serial': '086bd7c39baf', 'sniffer_serial': '7c9ebd9448a0', 'temp': 36.21, 'x': -0.4453125, 'y': -0.1328125, 'z': -0.8671875}], 21663: [{'date': '2021-07-13/10:28:01.680', 'id': 21663, 'rssi': -80, 'serial': '086bd7c39baf', 'sniffer_serial': '7c9ebd9448a0', 'temp': 36.21, 'x': -0.4140625, 'y': -0.1171875, 'z': -0.8515625}], 21664: [{'date': '2021-07-13/10:28:02.60', 'id': 21664, 'rssi': -88, 'serial': '086bd7c39baf', 'sniffer_serial': '7c9ebd9450cc', 'temp': 36.21, 'x': -0.4375, 'y': -0.0546875, 'z': -0.8515625}]})
答案 1 :(得分:1)
也许保留一本计数字典可能会有所帮助。这是我尝试过的解决方案。
data = []
count_book = {}
for i in loader:
if i['id'] not in count_book:
count_book[i['id']] = 0
if count_book[i['id']] < 3:
data.append(i)
count_book[i['id']] += 1
答案 2 :(得分:1)
您可以利用pandas
读取json文件,对id执行groupby
,然后这样只保留前3行:
import pandas as pd
df = pd.read_json('...') # json file directory
df = df.groupby('id').nth((0,1,2)).reset_index()
df.to_json("...", orient='records') # to save the result as json