将多嵌套的dict / json加载到熊猫

时间:2019-01-31 07:40:23

标签: python json pandas dictionary

我正在尝试将一个令人困惑的多嵌套JSON加载到熊猫中。我已经在使用json_normalize,但试图弄清楚如何加入2个similair嵌套的dict以及解包其子dictlist的方法,让我难过我对大熊猫的了解有限,但我假设只要将它放下来,就可以利用它的性能优势。

我有2个包含战争数据的字典,一个是从JSON API响应中加载的,另一个是在数据库中的。我正在尝试比较2个新的攻击和防御。

示例战争

{
  "state": "active",
  "team_size": 20,
  "teams": {
    "id": "12345679",
    "name": "Good Guys",
    "level": 10,
    "attacks": 4,
    "destruction_percentage": 22.6,
    "members": [
      {
        "id": "1",
        "name": "John",
        "level": 12
      },
      {
        "id": "2",
        "name": "Tom",
        "level": 11,
        "attacks": [
          {
            "attackerTag": "2",
            "defenderTag": "4",
            "damage": 64,
            "order": 7
          }
        ]
      }
    ]
  },
  "opponent": {
    "id": "987654321",
    "name": "Bad Guys",
    "level": 17,
    "attacks": 5,
    "damage": 20.95,
    "members": [
      {
        "id": "3",
        "name": "Betty",
        "level": 17,
        "attacks": [
          {
            "attacker_id": "3",
            "defender_id": "1",
            "damage": 70,
            "order": 1
          },
          {
            "attacker_id": "3",
            "defender_id": "7",
            "damage": 100,
            "order": 11
          }
        ],
        "opponentAttacks": 0,
        "some_useless_data": "Want to ignore, this doesn't show in every record"
      },
      {
        "id": "4",
        "name": "Fred",
        "level": 9,
        "attacks": [
          {
            "attacker_id": "4",
            "defender_id": "9",
            "damage": 70,
            "order": 4
          }
        ],
        "opponentAttacks": 0
      }
    ]
  }
}

现在,就性能而言,我假设熊猫将是我的最佳选择,而不是将它们拉在一起并遍历每个成员并进行比较。

因此,我很难说出要使dataframe平滑且易于遍历的df的尝试。最佳情况下,我假设采用以下布局。我只是想让两个团队都只占所有成员的一个state我们可以省略team_sizeattacks键,而只关注获得每个成员及其各自的team_iddf

示例 id name level attacks member.team_id ... 1 John 12 NaN "123456789" 2 Tom 11 [{...}] "123456789" 3 Betty 17 [{...}, {...}] "987654321" 4 Fred 9 [{...}] "987654321" (预期结果)

df

这就是我想要的pop()的基本要点。这样,我就可以同时获取两个数据帧并进行比较,以进行新的攻击。<​​/ p>

注意 我只是从字典中stateteam_sizeold_df = json_normalize(war, 'members', ['id', 'name', 'level', 'attacks'], record_prefix='member') #Traceback (most recent call last): # File "test.py", line 83, in <module> # new_parse(old_war, new_war) # File "test.py", line 79, in new_parse # record_prefix='member') # File "/home/jbacher/.local/lib/python3.7/site-packages/pandas/io/json/normalize.py", line 262, in json_normalize # _recursive_extract(data, record_path, {}, level=0) # File "/home/jbacher/.local/lib/python3.7/site-packages/pandas/io/json/normalize.py", line 238, in _recursive_extract # recs = _pull_field(obj, path[0]) # File "/home/jbacher/.local/lib/python3.7/site-packages/pandas/io/json/normalize.py", line 185, in _pull_field # result = result[spec] #KeyError: 'members' 开始尝试,因为我想要的是所有成员,并且团队几乎嵌入其中了

我很幸运地尝试了以下方法,我知道这不是正确的方法,因为它在字典树上向后工作。

df = pd.DataFrame.from_dict(old, orient='index')
df.droplevel('members')

#Traceback (most recent call last):
#  File "test.py", line 106, in <module>
#    new_parse(old_war, new_war)
#  File "test.py", line 87, in new_parse
#    df.droplevel('members')
#  File "/home/jbacher/.local/lib/python3.7/site-packages/pandas/core/generic.py", line 4376, in __getattr__
#    return object.__getattribute__(self, name)
#AttributeError: 'DataFrame' object has no attribute 'droplevel'

我以为我可以使用类似以下的内容,但是那也不起作用。

chromeOptions.addArguments(""--start-maximized")

我感谢任何指导!希望我投入了足够的精力来帮助理解我的预期结果,否则请告诉我!

修改 公平地说,我确实知道如何做到这一点,只是循环字典并创建一个具有适当日期的新成员列表,但是我觉得这比使用熊猫要低得多,因为我要在数百万次战争中这样做线程化的应用程序以及我可以从中获得的每一点性能,这对我和应用程序都是一个好处。 -再次感谢!

2 个答案:

答案 0 :(得分:1)

我相信您可以使用:

need = ['member.id', 'member.name', 'member.level', 'member.attacks','id']
df1 = json_normalize(war['teams'],
                     'members',
                     ['id', 'name', 'level', 'attacks'], 
                     record_prefix='member.')[need]
#print (df1)

df2 = json_normalize(war['opponent'],
                     'members',
                     ['id', 'name', 'level', 'attacks'], 
                     record_prefix='member.')[need]
#print (df2)


df1.columns = np.where(df1.columns.str.startswith('member.'), 
                       df1.columns.str.split('.', n=1).str[1],
                       'member.' + df1.columns)
df2.columns = np.where(df2.columns.str.startswith('member.'), 
                       df2.columns.str.split('.', n=1).str[1],
                       'member.' + df2.columns)


df = pd.concat([df1, df2], sort=False, ignore_index=True)
print (df)
  id   name  level                                            attacks  \
0  1   John     12                                                NaN   
1  2    Tom     11  [{'attackerTag': '2', 'defenderTag': '4', 'dam...   
2  3  Betty     17  [{'attacker_id': '3', 'defender_id': '1', 'dam...   
3  4   Fred      9  [{'attacker_id': '4', 'defender_id': '9', 'dam...   

   member.id  
0   12345679  
1   12345679  
2  987654321  
3  987654321  

答案 1 :(得分:1)

尝试使用以下四层:

d=war['teams']['members']+war['teams']['opponent']['members']
df = pd.DataFrame(d)
df = df.iloc[:,:4][['id','name','level','attacks']]
df['member.team_id']=[war['teams']['opponent']['id'] if i in war['teams']['opponent']['members'] else war['teams']['id'] for i in d]
print(df)

输出:

  id   name  level                                            attacks  \
0  1   John     12                                                NaN   
1  2    Tom     11  [{'attackerTag': '2', 'defenderTag': '4', 'dam...   
2  3  Betty     17  [{'attacker_id': '3', 'defender_id': '1', 'dam...   
3  4   Fred      9  [{'attacker_id': '4', 'defender_id': '9', 'dam...   

  member.team_id  
0       12345679  
1       12345679  
2      987654321  
3      987654321  
相关问题