当前使用Pandas和Numpy。我有一个名为“ df”的数据框。可以说我有以下数据,如何基于ween子句给第三列赋值?如果可能,我想 将其作为向量化方法 进行处理,以保持现有速度。
我尝试了lambda函数,但是坦率地说,我不明白自己在做什么,并且遇到了诸如对象在“之间”没有属性之类的错误。
常规方法-使用非矢量化方法:
NOTE: I am looking for a way to make this vectorised.
If df.['Col2'] is between 0 and 10
df.['Col 3'] = 1
Elseif df.['Col2'] is between 10.01 and 20
df.['Col3'] = 2
Else if df.['Col2'] is between 20.1 and 30
df.['Col3'] = 3
样本集
+------+------+------+
| Col1 | Col2 | Col3 |
+------+------+------+
| a | 5 | 1 |
| b | 10 | 1 |
| c | 15 | 2 |
| d | 20 | 2 |
| e | 25 | 3 |
| f | 30 | 3 |
| g | 1 | 1 |
| h | 11 | 2 |
| i | 21 | 3 |
| j | 7 | 1 |
+------+------+------+
非常感谢
答案 0 :(得分:5)
def cust_func(row):
r = row['Col2']
if r >=0 AND r<=10:
val = 1
elif r >=10.01 AND r<=20:
val = 2
elseif r>=20.01 AND r<=30:
val = 3
return val
df['Col3'] = df.apply(cust_func, axis=1)
cut_labels = [1, 2, 3]
cut_bins = [0, 10, 20,30]
df['Col3'] = pd.cut(df['Col2'], bins=cut_bins, labels=cut_labels)
答案 1 :(得分:2)
有两种方法:numpy select和numpy.searchsorted;我更喜欢后者,因为我不必列出条件-只要您对数据进行排序,它就适用于bisect算法。是的,我想认为这是最快的。
如果您运行一些时间并共享结果,那将很酷:
Col1 Col2
0 a 5
1 b 10
2 c 15
3 d 20
4 e 25
5 f 30
6 g 1
7 h 11
8 i 21
9 j 7
#step 1: create your 'conditions'
#sort dataframe on Col2
df = df.sort_values('Col2')
#benchmarks are ur ranges within which you set your scores/grade
benchmarks = np.array([10,20,30])
#the grades to be assigned for Col2
score = np.array([1,2,3])
#and use search sorted
#it will generate the indices for where the values should be
#e.g if you have [1,4,5] then the position of 3 will be 1, since it is between 1 and 4
#and python has a zero based index notation
indices = np.searchsorted(benchmarks,df.Col2)
#create ur new column by indexing the score array with the indices
df['Col3'] = score[indices]
df = df.sort_index()
df
Col1 Col2 Col3
0 a 5 1
1 b 10 1
2 c 15 2
3 d 20 2
4 e 25 3
5 f 30 3
6 g 1 1
7 h 11 2
8 i 21 3
9 j 7 1
答案 2 :(得分:1)
您可以使用np.select()完美地完成此操作。我添加了一些<=,因为我猜想您想更新所有值。但是,如果需要,这是一个简单的编辑。
conditions = [(df['Col2'] > 0) & (df['Col2'] <= 10),
(df['Col2'] > 10) & (df['Col2'] <= 20),
(df['Col2'] > 20) & (df['Col2'] <= 30) ]
updates = [1, 2, 3]
df["Col3"] = np.select(conditions, updates, default=999)
使用原始范围会导致这种情况,其中值== 10、20、30从np.select()获得值999。
conditions = [(df['Col2'] > 0) & (df['Col2'] < 10),
(df['Col2'] > 10.01) & (df['Col2'] < 20),
(df['Col2'] > 20.1) & (df['Col2'] < 30) ]
updates = [1, 2, 3]
df["Col3"] = np.select(conditions, updates, default=999)
print(df)
Col1 Col2 Col3
0 a 5 1
1 b 10 999
2 c 15 2
3 d 20 999
4 e 25 3
5 f 30 999
6 g 1 1
7 h 11 2
8 i 21 3
9 j 7 1