Question

Python的新手，使用python进入数据分析领域。我正在练习数据，其中一列具有87个不同的值，而另一列具有888个不同的值，因此我想删除后一列。我只是不明白如何处理这些专栏。我要对这些列进行分组还是删除列。如果我分组，那我该怎么办！？非常感谢您的想法。 @Toby Petty @Vaishali

例如：

"Departure" : { "Product" : { "name" : "Länstrafik - Buss 201", "num" : "201", "catCode" : "7", "catOutS" : "BLT", "catOutL" : "Länstrafik - Buss", "operatorCode" : "254", "operator" : "JLT", "operatorUrl" : "http://www.jlt.se" }, "Stops" : { "name" : "Gislaved Lundåkerskolan", "id" : "740040260", "extId" : "740040260", "routeIdx" : 12, "lon" : 13.530096, "lat" : 57.298178, "depTime" : "20:55:00", "depDate" : "2019-03-05" } }

import pandas as pd import bumpy as np

print("Count of distinct entries for car:", len(set(car_sales['car']))) print("Distinct entries for car:", set(car_sales['car']))

Answer 1

您到底是什么问题？

更新：经过一番澄清/猜测之后，我将假设该问题与两个问题有关：

如何将groupby仅限制在前k个组中（通过某些选择）。
如何汇总列，包括一些非数字列。

对于初学者来说，sns包含一些漂亮的数据集，这些数据集对于此类问题非常有用，例如，下面我们将使用“ mpg”，其中包含一些汽车和里程信息。

import pandas as pd
import numpy as np
import seaborn as sns

df = sns.load_dataset('mpg')

我们将提供的name分为brand和model：

df[['brand', 'model']] = pd.DataFrame(df.name.str.split(' ', n=1).values.tolist())
df.head(3)

Out[]:
    mpg  cylinders  displacement  horsepower  weight  acceleration  \
0  18.0          8         307.0       130.0    3504          12.0   
1  15.0          8         350.0       165.0    3693          11.5   
2  18.0          8         318.0       150.0    3436          11.0   

   model_year origin                       name      brand            model  
0          70    usa  chevrolet chevelle malibu  chevrolet  chevelle malibu  
1          70    usa          buick skylark 320      buick      skylark 320  
2          70    usa         plymouth satellite   plymouth        satellite

稍后，我们将添加一列n，我们将使用该列来统计我们的统计信息中有多少条目：

df['n'] = 1

根据最大值acceleration查找前5个组（OP希望使用总销售额，因此在他的情况下，我们将使用sales.sum()而不是acceleration.max()，但是在这里我们不这样做没有销售数字）。要点是为我们要报告的组建立索引（并将其他组重命名为“其他”）。我们将称为idx的索引转换为元组列表，以方便子集设置。

idx = df.groupby(['brand', 'model']).acceleration.max().sort_values(ascending=False).head(5).index.to_list()
idx

Out[]:
[('peugeot', '504'),
 ('vw', 'pickup'),
 ('vw', 'dasher (diesel)'),
 ('volkswagen', 'type 3'),
 ('chevrolet', 'chevette')]

现在构建一个布尔选择器top10，它是所选组的True。

top10 = df.set_index(['brand', 'model']).index.isin(idx)

重命名其他人

df.loc[~top10, 'brand'] = 'Other'
df.loc[~top10, 'model'] = ''

现在，对于非数字列，我们选择报告多数值（组中最常出现的值）。

from collections import Counter
def majority(*args):
    return Counter(*args).most_common(1)[0][0]

# example
majority('z a b a a c d'.split())

Out[]:
'a'

最后，我们定义了用于各列的聚合器的字典：

# numeric: use mean
desired = {k:'mean' for k in df.columns if np.issubdtype(df[k], np.number)}
# simplified:
desired = {k:'mean' for k in ['mpg', 'horsepower', 'weight']}

# non-numeric: use majority    
desired.update({'origin': majority})

# also report the size of each group
desired.update({'n': 'sum'})

现在，进行分组和汇总：

df.groupby(['brand', 'model']).agg(desired)

Out[]:
                                  mpg  horsepower       weight  origin    n
brand      model                                                           
Other                       23.340052  105.540682  2984.651163     usa  387
chevrolet  chevette         30.400000   63.250000  2090.250000     usa    4
peugeot    504              23.550000   83.500000  3022.250000  europe    4
volkswagen type 3           23.000000   54.000000  2254.000000  europe    1
vw         dasher (diesel)  43.400000   48.000000  2335.000000  europe    1
           pickup           44.000000   52.000000  2130.000000  europe    1

处理数据框列中的大量不同值

1 个答案: