Question

我试图使用Pandas在每列中查找不同值的计数。这就是我所做的。

import pandas as pd
import numpy as np

# Generate data.
NROW = 10000
NCOL = 100
df = pd.DataFrame(np.random.randint(1, 100000, (NROW, NCOL)),
                  columns=['col' + x for x in np.arange(NCOL).astype(str)])

我需要计算每列的不同元素的数量，如下所示：

col0    9538
col1    9505
col2    9524

最有效的方法是什么，因为此方法将应用于大小超过1.5GB的文件？

根据答案，df.apply(lambda x: len(x.unique()))最快（notebook）。

%timeit df.apply(lambda x: len(x.unique())) 10 loops, best of 3: 49.5 ms per loop %timeit df.nunique() 10 loops, best of 3: 59.7 ms per loop %timeit df.apply(pd.Series.nunique) 10 loops, best of 3: 60.3 ms per loop %timeit df.T.apply(lambda x: x.nunique(), axis=1) 10 loops, best of 3: 60.5 ms per loop

Answer 1

从 pandas 0.20 开始，我们可以直接在DataFrame上使用df.nunique() a 4 b 5 c 1 dtype: int64，即：

In [205]:
df = pd.DataFrame({'a':[0,1,1,2,3],'b':[1,2,3,4,5],'c':[1,1,1,1,1]})
df

Out[205]:
   a  b  c
0  0  1  1
1  1  2  1
2  1  3  1
3  2  4  1
4  3  5  1

In [206]:
df.T.apply(lambda x: x.nunique(), axis=1)

Out[206]:
a    4
b    5
c    1
dtype: int64

其他遗留选项：

您可以对df进行转置，然后逐行apply调用nunique：

In [208]:
df.apply(pd.Series.nunique)

Out[208]:
a    4
b    5
c    1
dtype: int64

修改

如@ajcr所指出，转置是不必要的：

avgfrequency USER.ID 1 0.000000000 391 2 0.004081633 147389 3 0.007633588 140961 4 0.007776050 144216 5 0.031250000 142114 6 0.046849758 121811 7 0.057613169 121801 8 0.060553633 119451 9 0.067961165 121231 10 0.071428571 130791 11 0.074866310 121307 12 0.076923077 131049 13 0.083984375 120347 14 0.088471850 90723 15 0.100000000 130732 16 0.117647059 150569 17 0.125000000 138619 18 0.142857143 111617 19 0.153846154 145123 20 0.166666667 120914 21 0.174157303 64094 22 0.176470588 121937 23 0.181818182 147190 24 0.193548387 121156 25 0.196261682 122516 26 0.198795181 119618 27 0.200000000 50273 28 0.206896552 69968 29 0.208333333 117301 30 0.222222222 118837 31 0.223880597 121137 32 0.227272727 121071 33 0.230769231 142132 34 0.241379310 129447 35 0.250000000 10074 36 0.260869565 120265 37 0.261780105 29409 38 0.266666667 145135 39 0.272727273 126617 40 0.283950617 64339 41 0.285714286 112166 42 0.291666667 140435 43 0.303571429 119261 44 0.307692308 76589 45 0.312500000 124037 46 0.318181818 6429 47 0.329032258 41235 48 0.332603939 22633 49 0.333333333 3960 50 0.342857143 121545 51 0.363636364 123263 52 0.368421053 131234 53 0.369565217 53648 54 0.370370370 125421 55 0.373134328 119177 56 0.375000000 123182 57 0.376470588 119624 58 0.380952381 137438 59 0.384615385 127353 60 0.387755102 123346 61 0.388888889 122187 62 0.400000000 67889 63 0.408602151 27670 64 0.416666667 127766 65 0.421052632 126593 66 0.426470588 118098 67 0.428571429 53315 68 0.429268293 27734 69 0.431034483 120953 70 0.437500000 125508 71 0.439252336 38652 72 0.444444444 132263 73 0.450000000 134343 74 0.454545455 124898 75 0.458333333 40114 76 0.466666667 64172 77 0.470588235 115263 78 0.476190476 127675 79 0.478260870 119756 80 0.480769231 115722

Answer 2

Pandas.Series具有.value_counts()功能，可以提供您想要的功能。 Check out the documentation for the function

Answer 3

这里已经有了一些很棒的答案:)但是这个似乎很缺失：

invalid literal for int() with base 10: 'Stalone'

截至pandas 0.20.0，df.apply(lambda x: x.nunique())也可用。

Answer 4

最近，我在计算DataFrame中每列的唯一值时遇到了同样的问题，我发现其他一些函数运行速度比apply函数快：

#Select the way how you want to store the output, could be pd.DataFrame or Dict, I will use Dict to demonstrate:
col_uni_val={}
for i in df.columns:
    col_uni_val[i] = len(df[i].unique())

#Import pprint to display dic nicely:
import pprint
pprint.pprint(col_uni_val)

这对我来说比df.apply(lambda x: len(x.unique()))

快几乎两倍

Answer 5

int oneAg = ags[0];

Answer 6

对于pandas_python中的所有列，只需要分隔具有20个以上唯一值的列即可：

enter code here
col_with_morethan_20_unique_values_cat=[]
for col in data.columns:
    if data[col].dtype =='O':
        if len(data[col].unique()) >20:

        ....col_with_morethan_20_unique_values_cat.append(data[col].name)
        else:
            continue

print(col_with_morethan_20_unique_values_cat)
print('total number of columns with more than 20 number of unique value is',len(col_with_morethan_20_unique_values_cat))



 # The o/p will be as:
['CONTRACT NO', 'X2','X3',,,,,,,..]
total number of columns with more than 20 number of unique value is 25

Answer 7

为@ CaMaDuPe85给出的答案添加示例代码

class ServiceBase(models.Model):
    # special fields for clothes_washing
    service = models.OneToOneField(Service, on_delete=models.CASCADE, null=True)

    def delete(self, *args, **kwargs):
        service = self.service
        super().delete(*args, **kwargs)
        self.service.delete()

    class Meta:
        abstract = True

class ClothesWashing(ServiceBase):
    # …
    pass

class RoomCleaning(ServiceBase):
    # …
    pass

Answer 8

我发现：

df.agg(['nunique']).T

快得多

在每列中查找DataFrame中不同元素的数量

8 个答案: