Question

我有一个看起来像这样的数据框，

df=pd.DataFrame({'col1':[1,2,3,4,5,6], 'col2':list('AASOSP')})
df

我有两个清单，

lis1=['A']
Lis2=['S','O']

我需要根据lis1和lis2替换col2中的值。所以我使用np.where来做到这一点。像这样，

df['col2'] = np.where(df.col2.isin(lis1),'PC',df.col2.isin(lis2),'Ln','others')

但它给我带来了以下错误，

TypeError：函数最多需要3个参数（给定5个参数）任何建议都非常感谢。!!

最后，我的目标是将数据框的col2中的值替换为

    col1    col2
0   1   PC
1   2   PC
2   3   Ln
3   4   Ln
4   5   Ln
5   6   others

Answer 1

使用双numpy.where：

lis1=['A']
lis2=['S','O']

df['col2'] = np.where(df.col2.isin(lis1),'PC',
             np.where(df.col2.isin(lis2),'Ln','others'))

print (df)
   col1    col2
0     1      PC
1     2      PC
2     3      Ln
3     4      Ln
4     5      Ln
5     6  others

<强>计时：

#[60000 rows x 2 columns]
df = pd.concat([df]*10000).reset_index(drop=True)

In [257]: %timeitnp.where(df.col2.isin(lis1),'PC',np.where(df.col2.isin(lis2),'Ln','others'))
100 loops, best of 3: 8.15 ms per loop

In [258]: %timeit in1d_based(df, lis1, lis2)
100 loops, best of 3: 4.98 ms per loop

Answer 2

这是一种方法 -

a = df.col2.values
df.col2 = np.take(['others','PC','Ln'], np.in1d(a,lis1) + 2*np.in1d(a,lis2))

逐步运行示例 -

# Input dataframe
In [206]: df
Out[206]: 
   col1 col2
0     1    A
1     2    A
2     3    S
3     4    O
4     5    S
5     6    P

# Extract out col2 values
In [207]: a = df.col2.values

# Form an indexing array based on where we have matches in lis1 or lis2 or neither
In [208]: idx = np.in1d(a,lis1) + 2*np.in1d(a,lis2)

In [209]: idx
Out[209]: array([1, 1, 2, 2, 2, 0])

# Index into a list of new strings with those indices
In [210]: newvals = np.take(['others','PC','Ln'], idx)

In [211]: newvals
Out[211]: 
array(['PC', 'PC', 'Ln', 'Ln', 'Ln', 'others'], 
      dtype='|S6')

# Finally assign those into col2
In [212]: df.col2 = newvals

In [213]: df
Out[213]: 
   col1    col2
0     1      PC
1     2      PC
2     3      Ln
3     4      Ln
4     5      Ln
5     6  others

运行时测试 -

In [251]: df=pd.DataFrame({'col1':[1,2,3,4,5,6], 'col2':list('AASOSP')})

In [252]: df = pd.concat([df]*10000).reset_index(drop=True)

In [253]: lis1
Out[253]: ['A']

In [254]: lis2
Out[254]: ['S', 'O']

In [255]: def in1d_based(df, lis1, lis2):
     ...:     a = df.col2.values
     ...:     return np.take(['others','PC','Ln'], np.in1d(a,lis1) + 2*np.in1d(a,lis2))
     ...: 

# @jezrael's soln
In [256]: %timeit np.where(df.col2.isin(lis1),'PC', np.where(df.col2.isin(lis2),'Ln','others'))
100 loops, best of 3: 3.78 ms per loop

In [257]: %timeit in1d_based(df, lis1, lis2)
1000 loops, best of 3: 1.89 ms per loop

从数据框中的列表重命名列中的值

2 个答案: