过滤 json 对象数组中的出现

时间:2021-07-14 19:38:02

标签: python json python-3.x

我有一个具有这种结构的大型 JSON 文件:

[
    {
       "sniffer_serial":"7c9ebd9448a0",
       "serial":"086bd7c39c8c",
       "temp":31.36,
       "x":-0.484375,
       "y":-0.0078125,
       "z":-0.859375,
       "rssi":-70,
       "id":33069,
       "date":"2021-07-14 15:45:54.411"
    },
    {
        "sniffer_serial":"7c9ebd945194",
        "serial":"086bd7c39c8c",
        "temp":31.36,
        "x":-0.484375,
        "y":-0.0078125,
        "z":-0.859375,
        "rssi":-70,
        "id":33069,
        "date":"2021-07-14 15:45:54.414"
     },
     {
        "sniffer_serial":"7c9ebd9448a0",
        "serial":"086bd7c39c8c",
        "temp":31.36,
        "x":-0.484375,
        "y":-0.0078125,
        "z":-0.859375,
        "rssi":-70,
        "id":33069,
        "date":"2021-07-14 15:45:54.536"
     },
     {
        "sniffer_serial":"7c9ebd945194",
        "serial":"086bd7c39c8c",
        "temp":31.36,
        "x":-0.484375,
        "y":-0.0078125,
        "z":-0.859375,
        "rssi":-70,
        "id":33069,
        "date":"2021-07-14 15:45:54.539"
     },
     {
        "sniffer_serial":"7c9ebd9448a0",
        "serial":"086bd7c39c8c",
        "temp":31.36,
        "x":-0.484375,
        "y":-0.0078125,
        "z":-0.859375,
        "rssi":-70,
        "id":33069,
        "date":"2021-07-14 15:45:54.661"
     },
     {
        "sniffer_serial":"7c9ebd945194",
        "serial":"086bd7c39c8c",
        "temp":31.36,
        "x":-0.484375,
        "y":-0.0078125,
        "z":-0.859375,
        "rssi":-70,
        "id":33069,
        "date":"2021-07-14 15:45:54.664"
     },
     {
        "date": "2021-07-13/10:28:00.930",
        "id": 21661,
        "rssi": -81,
        "serial": "086bd7c39baf",
        "sniffer_serial": "7c9ebd9448a0",
        "temp": 36.21,
        "x": -0.4453125,
        "y": -0.1328125,
        "z": -0.8671875
    },
    {
        "date": "2021-07-13/10:28:01.680",
        "id": 21663,
        "rssi": -80,
        "serial": "086bd7c39baf",
        "sniffer_serial": "7c9ebd9448a0",
        "temp": 36.21,
        "x": -0.4140625,
        "y": -0.1171875,
        "z": -0.8515625
    },
    {
        "date": "2021-07-13/10:28:02.60",
        "id": 21664,
        "rssi": -88,
        "serial": "086bd7c39baf",
        "sniffer_serial": "7c9ebd9450cc",
        "temp": 36.21,
        "x": -0.4375,
        "y": -0.0546875,
        "z": -0.8515625
    }
 ]

如您所见,我有一些重复的值。 id 33069 重复 6 次,即每个 sniffer_serial 重复 3 次,只是它们之间的时间戳不同。

我想知道的是保留相同 id 的前三个出现,并去除其他三个。

在这个例子中,这个重复模式只出现了一次,但它可以在整个文件中出现多次。

到目前为止我得到的是如何只保留每个 id 的第一次出现并将其附加到列表中。

loader = json.loads(myJsonFile)
data = []
for key, items in groupby(sorted(loader, key=lambda x: (x['serial'], x['date'])), key=lambda x: x['id']):
            data.append(next(items))

3 个答案:

答案 0 :(得分:1)

你可以使用defaultdict

>>> from collections import defaultdict 
>>> data = defaultdict(list)
>>> for x in loader:
...   if len(data[x['id']]) < 3:
...     data[x['id']].append(x)
...
>>> data
defaultdict(<class 'list'>, {33069: [{'sniffer_serial': '7c9ebd9448a0', 'serial': '086bd7c39c8c', 'temp': 31.36, 'x': -0.484375, 'y': -0.0078125, 'z': -0.859375, 'rssi': -70, 'id': 33069, 'date': '2021-07-14 15:45:54.411'}, {'sniffer_serial': '7c9ebd945194', 'serial': '086bd7c39c8c', 'temp': 31.36, 'x': -0.484375, 'y': -0.0078125, 'z': -0.859375, 'rssi': -70, 'id': 33069, 'date': '2021-07-14 15:45:54.414'}, {'sniffer_serial': '7c9ebd9448a0', 'serial': '086bd7c39c8c', 'temp': 31.36, 'x': -0.484375, 'y': -0.0078125, 'z': -0.859375, 'rssi': -70, 'id': 33069, 'date': '2021-07-14 15:45:54.536'}, {'sniffer_serial': '7c9ebd945194', 'serial': '086bd7c39c8c', 'temp': 31.36, 'x': -0.484375, 'y': -0.0078125, 'z': -0.859375, 'rssi': -70, 'id': 33069, 'date': '2021-07-14 15:45:54.539'}, {'sniffer_serial': '7c9ebd9448a0', 'serial': '086bd7c39c8c', 'temp': 31.36, 'x': -0.484375, 'y': -0.0078125, 'z': -0.859375, 'rssi': -70, 'id': 33069, 'date': '2021-07-14 15:45:54.661'}, {'sniffer_serial': '7c9ebd945194', 'serial': '086bd7c39c8c', 'temp': 31.36, 'x': -0.484375, 'y': -0.0078125, 'z': -0.859375, 'rssi': -70, 'id': 33069, 'date': '2021-07-14 15:45:54.664'}], 21661: [{'date': '2021-07-13/10:28:00.930', 'id': 21661, 'rssi': -81, 'serial': '086bd7c39baf', 'sniffer_serial': '7c9ebd9448a0', 'temp': 36.21, 'x': -0.4453125, 'y': -0.1328125, 'z': -0.8671875}], 21663: [{'date': '2021-07-13/10:28:01.680', 'id': 21663, 'rssi': -80, 'serial': '086bd7c39baf', 'sniffer_serial': '7c9ebd9448a0', 'temp': 36.21, 'x': -0.4140625, 'y': -0.1171875, 'z': -0.8515625}], 21664: [{'date': '2021-07-13/10:28:02.60', 'id': 21664, 'rssi': -88, 'serial': '086bd7c39baf', 'sniffer_serial': '7c9ebd9450cc', 'temp': 36.21, 'x': -0.4375, 'y': -0.0546875, 'z': -0.8515625}]})

答案 1 :(得分:1)

也许保留一本计数字典可能会有所帮助。这是我尝试过的解决方案。

data = []
count_book = {}
for i in loader:
    if i['id'] not in count_book:
        count_book[i['id']] = 0
    if count_book[i['id']] < 3:
        data.append(i)
        count_book[i['id']] += 1

答案 2 :(得分:1)

您可以利用pandas读取json文件,对id执行groupby,然后这样只保留前3行:

import pandas as pd
df = pd.read_json('...') # json file directory
df = df.groupby('id').nth((0,1,2)).reset_index()

df.to_json("...", orient='records') # to save the result as json