Question

我有一个非常大的数据集，我正在寻找一种有效的方法来获得一个特征及其频率的最高频率值。我发现groupby非常优雅，然后使用describe来获取统计数据，但是使用大数据帧需要很长时间。我想知道是否有更有效的方式。

这是一个玩具示例，使用数据框架随机生成汽车租赁记录。

import pandas as pd
import random

colors   = ['white','red','silver','black','blue','yellow']
vendors  = ['Toyota','Wolkswagen','FCA','Renault']
car_types = ['suv','van','citycar']
drivers   = ['Meredith','Sau','Chuck']

random.seed(17)  
records = [ { 'record':i,
              'driver':random.choice(drivers),
              'vendor':random.choice(vendors),
              'type'  :random.choice(car_types),
              'color' :random.choice(colors)} for i in range(0,100)]

df = pd.DataFrame(records)
gb = df.groupby('driver')
print(gb['type'].describe())

输出：

         count unique      top freq
driver                             
Chuck       30      3      van   14
Meredith    32      3      van   13
Sau         38      3  citycar   15

在我的应用程序中，我有一个包含数十万个项目的数据集，好的行gb['type'].describe()显然需要很长时间。

我认为这与类似的问题不同

what is the most efficient way of counting occurrences in pandas?

从某种意义上说，我不只是寻找某个特征的最常见值，而是寻找与任何密钥相关的该特征的最常见值。在上面的示例中，我想知道每个驱动程序最常租用的汽车类型。 describe获得了格式良好的信息以及我不需要的更多信息。我试图使用gb['type'].value_counts()：

driver    type   
Chuck     van        14
          suv        11
          citycar     5
Meredith  van        13
          citycar    11
          suv         8
Sau       citycar    15
          van        14
          suv         9
Name: type, dtype: int64

但看起来我无法从多索引系列中解脱出来，得到我想要的数据帧，如

          type       freq
driver
Chuck     van        14
Meredith  van        13
Sau       citycar    15

熊猫：获得最高价值及其数量的有效方式

0 个答案: