熊猫对两个数字之间的列进行操作

时间:2020-04-06 01:59:49

标签: python python-3.x pandas dataframe

当前使用Pandas和Numpy。我有一个名为“ df”的数据框。可以说我有以下数据,如何基于ween子句给第三列赋值?如果可能,我想 将其作为向量化方法 进行处理,以保持现有速度。

我尝试了lambda函数,但是坦率地说,我不明白自己在做什么,并且遇到了诸如对象在“之间”没有属性之类的错误。

常规方法-使用非矢量化方法:

NOTE: I am looking for a way to make this vectorised.

If df.['Col2'] is between 0 and 10
   df.['Col 3'] = 1
Elseif df.['Col2'] is between 10.01 and 20
   df.['Col3']  = 2
Else if df.['Col2'] is between 20.1 and 30
   df.['Col3']  = 3

样本集

+------+------+------+
| Col1 | Col2 | Col3 |
+------+------+------+
| a    |    5 |    1 |
| b    |   10 |    1 |
| c    |   15 |    2 |
| d    |   20 |    2 |
| e    |   25 |    3 |
| f    |   30 |    3 |
| g    |    1 |    1 |
| h    |   11 |    2 |
| i    |   21 |    3 |
| j    |    7 |    1 |
+------+------+------+


非常感谢

3 个答案:

答案 0 :(得分:5)

解决方案重用您当前的代码:

def cust_func(row):
    r = row['Col2']
    if  r >=0 AND r<=10:
        val = 1
    elif r >=10.01 AND r<=20:
        val = 2
    elseif r>=20.01 AND r<=30:
        val = 3
    return val

df['Col3'] = df.apply(cust_func, axis=1)

最佳解决方案:

cut_labels = [1, 2, 3]
cut_bins = [0, 10, 20,30]
df['Col3'] = pd.cut(df['Col2'], bins=cut_bins, labels=cut_labels)

答案 1 :(得分:2)

有两种方法:numpy selectnumpy.searchsorted;我更喜欢后者,因为我不必列出条件-只要您对数据进行排序,它就适用于bisect算法。是的,我想认为这是最快的。

如果您运行一些时间并共享结果,那将很酷:

  Col1  Col2
0   a   5
1   b   10
2   c   15
3   d   20
4   e   25
5   f   30
6   g   1
7   h   11
8   i   21
9   j   7

   #step 1: create your 'conditions'

#sort dataframe on Col2

df = df.sort_values('Col2')
#benchmarks are ur ranges within which you set your scores/grade
benchmarks = np.array([10,20,30])

#the grades to be assigned for Col2
score = np.array([1,2,3])

#and use search sorted
#it will generate the indices for where the values should be
#e.g if you have [1,4,5] then the position of 3 will be 1, since it is between 1 and 4
#and python has a zero based index notation
indices = np.searchsorted(benchmarks,df.Col2)

#create ur new column by indexing the score array with the indices
df['Col3'] = score[indices]

df = df.sort_index()

df

    Col1    Col2  Col3
0    a       5      1
1    b       10     1
2    c       15     2
3    d       20     2
4    e       25     3
5    f       30     3
6    g       1      1
7    h       11     2
8    i       21     3
9    j       7      1

答案 2 :(得分:1)

您可以使用np.select()完美地完成此操作。我添加了一些<=,因为我猜想您想更新所有值。但是,如果需要,这是一个简单的编辑。

conditions = [(df['Col2'] > 0) & (df['Col2'] <= 10),
               (df['Col2'] > 10) & (df['Col2'] <= 20),
               (df['Col2'] > 20) & (df['Col2'] <= 30) ]

updates = [1, 2, 3]

df["Col3"] = np.select(conditions, updates, default=999)

使用原始范围会导致这种情况,其中值== 10、20、30从np.select()获得值999。

conditions = [(df['Col2'] > 0) & (df['Col2'] < 10),
               (df['Col2'] > 10.01) & (df['Col2'] < 20),
               (df['Col2'] > 20.1) & (df['Col2'] < 30) ]

updates = [1, 2, 3]

df["Col3"] = np.select(conditions, updates, default=999)

print(df)

    Col1    Col2    Col3
0   a   5   1
1   b   10  999
2   c   15  2
3   d   20  999
4   e   25  3
5   f   30  999
6   g   1   1
7   h   11  2
8   i   21  3
9   j   7   1