Question

我正在使用pandas groupby并申请从包含以下行的1.5亿行的DataFrame中获取数据：

Id  Created     Item    Stock   Price
1   2019-01-01  Item 1  200     10
1   2019-01-01  Item 2  100     15
2   2019-01-01  Item 1  200     10

列出如下所示的220万条记录：

[{
  "Id": 1,
  "Created": "2019-01-01",
  "Items": [
    {"Item":"Item 1", "Stock": 200, "Price": 10},
    {"Item":"Item 2", "Stock": 100, "Price": 5}
    ]
},
{
  "Id": 2,
  "Created": "2019-01-01",
  "Items": [
    {"Item":"Item 1", "Stock": 200, "Price": 10}
    ]
}]

主要使用以下代码行：

df.groupby(['Id', 'Created']).apply(lambda x: x[['Item', 'Stock', 'Price']].to_dict(orient='records'))

这需要花费很多时间，据我了解，像这样的操作对于熊猫来说很繁重。有没有熊猫可以做到相同但性能更高的方法？

编辑：该操作耗时55分钟，我在AWS中使用ScriptProcessor，可让我指定所需的电量。

编辑2 ：因此，使用artonas解决方案，我越来越接近：这是我现在设法生产的：

defaultdict(<function __main__.<lambda>()>,
            {'1': defaultdict(list,
                         {'Id': '1',
                          'Created':'2019-01-01',
                          'Items': [{'Item': Item2, 'Stock': 100, 'Price': 15},
                                    {'Item': Item1, 'Stock': 200, 'Price': 10}]
                         })
            },
           {'2': defaultdict(list,
                         {'Id': '2',
                          'Created':'2019-01-01',
                          'Items': [{'Item': Item1, 'Stock': 200, 'Price': 10}]
                         })
            },

但是如何从上面转到这个？

[{
  "Id": 1,
  "Created": "2019-01-01",
  "Items": [
    {"Item":"Item 1", "Stock": 200, "Price": 10},
    {"Item":"Item 2", "Stock": 100, "Price": 5}
    ]
},
{
  "Id": 2,
  "Created": "2019-01-01",
  "Items": [
    {"Item":"Item 1", "Stock": 200, "Price": 10}
    ]
}]

基本上，对于所有记录，我只会在“ defaultdict（list，”）之后的部分中关注我。我需要将其放在不依赖于ID作为键的列表中。

编辑3 ：包含我的生产数据集结果的最新更新。有了artona提供的公认答案，我设法从 55 分钟缩短到 7 （！）分钟。而且我的代码没有任何重大变化。 Phung Duy Phong提供的解决方案使我从55分钟缩短到了17分钟。

Answer 1

如果数据帧被整齐地排序，这意味着同一对（Id，Created）的所有行都是连续的，则可以简单地对其进行迭代。但是由于迭代一个数据帧非常昂贵，因为熊猫必须在每一行建立一个新的Series，所以我将直接迭代底层的numpy数组。

代码可能是：

records = []
Id = None

for i in range(len(df)):
    if df['Id'].values[i] != Id or df['Created'].values[i] != created:
        items = []
        Id = df['Id'].values[i]
        created = df['Created'].values[i]
        records.append({'Id': Id, 'Created': created,
                'Items': items})

    items.append({x: df[x].values[i]
              for x in ['Item', 'Stock', 'Price']})

如果最初未对数据进行排序，则可以尝试使用pandas对数据框进行排序，然后使用上面的代码

Answer 2

使用collections.defaultdict和itertuples。它仅在行上迭代一次。

In [105]: %timeit df.groupby(['Id', 'Created']).apply(lambda x: x[['Item', 'Stock', 'Price']].to_dict(orient='records'))
10.1 s ± 44.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [107]:from collections import defaultdict
     ...:def create_dict():
     ...:     dict_ids = defaultdict(lambda : defaultdict(list))
     ...:     for row in df.itertuples():
     ...:          dict_ids[row.Id][row.Created].append({"Item": row.Item, "Stock": row.Stock, "Price": row.Price})
     ...:     list_of_dicts = [{"Id":key_id, "Created":key_created, "Items": values} for key_id, value_id in dict_ids.items() for key_created, values in value_id.items()]
     ...:     return list_of_dicts

In [108]: %timeit create_dict()
4.58 s ± 417 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Answer 3

尝试以下操作：

df['Items'] = df.loc[:, ['X', 'Y', 'Z']].to_dict(orient='records')
df.groupby(['ID', 'CREATED'])['Items'].apply(list).reset_index().to_dict(orient='records')

替换大熊猫groupby并申请提高绩效

3 个答案: