Question

假设我有一个这样的熊猫df：

Index   A     B
0      foo    3
1      foo    2
2      foo    5
3      bar    3
4      bar    4
5      baz    5

添加这样的列的一种好方法是什么？

Index   A     B    Aidx
0      foo    3    0
1      foo    2    0
2      foo    5    0
3      bar    3    1
4      bar    4    1
5      baz    5    2

即为每个唯一值增加索引吗？

我知道我可以使用df.unique()，然后使用字典和enumerate创建查找，然后应用该字典查找创建列。但是我觉得应该有一种更快的方法，可能涉及到groupby并具有某些特殊功能吗？

Answer 1

一种方法是使用ngroup。请记住，您必须确保您的groupby不会诉诸组来获得所需的输出，因此设置sort=False：

df['Aidx'] = df.groupby('A',sort=False).ngroup()
>>> df
   Index    A  B  Aidx
0      0  foo  3     0
1      1  foo  2     0
2      2  foo  5     0
3      3  bar  3     1
4      4  bar  4     1
5      5  baz  5     2

Answer 2

不需要groupby使用

方法1 factorize

pd.factorize(df.A)[0]
array([0, 0, 0, 1, 1, 2], dtype=int64)
#df['Aidx']=pd.factorize(df.A)[0]

方法2 sklearn

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df.A)
LabelEncoder()
le.transform(df.A)
array([2, 2, 2, 0, 0, 1])

方法3 cat.codes

df.A.astype('category').cat.codes

方法4 map + unique

l=df.A.unique()
df.A.map(dict(zip(l,range(len(l)))))
0    0
1    0
2    0
3    1
4    1
5    2
Name: A, dtype: int64

方法5 np.unique

x,y=np.unique(df.A.values,return_inverse=True)
y
array([2, 2, 2, 0, 0, 1], dtype=int64)

编辑：OP数据框的某些计时

'''

%timeit pd.factorize(view.Company)[0]

The slowest run took 6.68 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 155 µs per loop

%timeit view.Company.astype('category').cat.codes

The slowest run took 4.48 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 449 µs per loop

from itertools import izip

%timeit l = view.Company.unique(); view.Company.map(dict(izip(l,xrange(len(l)))))

1000 loops, best of 3: 666 µs per loop

import numpy as np

%timeit np.unique(view.Company.values, return_inverse=True)

The slowest run took 8.08 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 32.7 µs per loop

好像是numpy赢了。

Answer 3

还有另一种方法。

df['C'] = i.ne(df.A.shift()).cumsum()-1
df

当我们打印df值时，它将如下所示。

  Index  A    B  C
0  0     foo  3  0
1  1     foo  2  0 
2  2     foo  5  0 
3  3     bar  3  1 
4  4     bar  4  1 
5  5     baz  5  2

解决方案的说明： 为了理解目的，让我们将上述解决方案分为几个部分。

第一步： 比较df的A列，方法如下：将其值向下移至其自身。

i.ne(df.A.shift())

我们将得到的输出是：

0     True
1    False
2    False
3     True
4    False
5     True

第二步： 使用cumsum()函数，因此无论TRUE值在哪里（当A列的匹配项和其位移不为时，都会出现）找到），它将调用cumsum()函数，其值将增加。

i.ne(df.A.shift()).cumsum()-1
0    0
1    0
2    0
3    1
4    1
5    2
Name: A, dtype: int32

第三步： 将命令的值保存到df['C']中，这将在C中创建一个名为df的新列。

将pandas列转换为“增加”索引的pythonic和uFunc-y方法？

3 个答案: