Question

所以我有一个看似如下的数据框：

index first_key second_key data text  other
  1      34        987       2   'a'  'name'
  2      34        987       3   'b'  'name' 
  3      40        340       2   'c'  'dog'
  4      34        123       23  'd'  'name'

使用JOIN从数据库中提取我想要一个看起来像的数组：

[
   {
     first_key: 34,
     other: 'name',
     second_key: [
        {
          second_key: 987,
          data: [2, 3],
          text: ['a', 'b']
        },
        {
          second_key: 123,
          data: [2],
          text: ['c']
        }
     ]
   }
   {
     first_key: 40,
     other: 'dog',
     second_key: [
        {
          second_key: 340,
          data: [2, 3],
          text: ['a', 'b']
        }
     ]
   }
]

现在，我只是循环遍历每一行并逐段构建输出，但它确实很慢。首先崩溃那些行会快得多。

我尝试使用groupby和groups以及numpy，但我无法实现。性能在这里至关重要。

谢谢！

Answer 1

这非常接近：

new_index = ['first_key', 'other', 'second_key']
df2 = df.set_index(new_index).groupby(new_index).apply(dict)
foo = json.loads(df2.to_json(orient='table'))
print json.dumps(foo, indent=2, sort_keys=True)

优化Double groupby性能

1 个答案: