如果我具有以下数据框。我想基于A列的多个条件返回B列的任意值,这是规则:如果A列中的值> = 0且<50,则返回B列的原始值;如果A列中的值> = 50并且<70,则返回B列的值除以3;如果A列中的值> = 70并且<100,则返回B列的返回值除以C列和3。
import pandas as pd
import numpy as np
np.random.seed(5)
df = pd.DataFrame(np.random.randint(100, size=(100, 3)), columns=list('ABC'))
我在Python中的伪代码:
def Standard():
if (df['A'] >= 0) and (df['A'] < 50):
return df['B']
if (df['A'] >= 50) and (df['A'] < 70):
return df['B']/3
if (df['A'] >= 70) and (df['A'] <= 100):
return df['B']/df['C']/3
df['B'] = df.apply(Standard, axis = 1)
它返回:TypeError: ('Standard() takes 0 positional arguments but 1 was given', 'occurred at index 0')
如何纠正我的代码,或者Python中还有其他更好的方法?感谢您的帮助。
答案 0 :(得分:3)
为获得更好的性能,请使用numpy.select
代替apply
,如果不符合任何条件,也可以设置默认值:
masks = [(df['A'] >= 0) & (df['A'] < 50),
(df['A'] >= 50) & (df['A'] < 70),
(df['A'] >= 70) & (df['A'] <= 100)]
vals = [df['B'], df['B'] / 3, df['B']/df['C']/3]
df['B'] = np.select(masks, vals, default=0)
性能-大约快1000倍:
np.random.seed(5)
df = pd.DataFrame(np.random.randint(100, size=(10000, 3)), columns=list('ABC'))
#Jeril solution
In [74]: %timeit df['B1'] = df.apply(Standard, axis=1)
__main__:18: RuntimeWarning: divide by zero encountered in double_scalars
424 ms ± 16.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [75]: %timeit df['B'] = np.select(masks, vals, default=0)
468 µs ± 4.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
答案 1 :(得分:2)
您可以尝试以下方法吗?
def Standard(row):
if (row['A'] >= 0) and (row['A'] < 50):
return row['B']
if (row['A'] >= 50) and (row['A'] < 70):
return row['B']/3
if (row['A'] >= 70) and (row['A'] <= 100):
return row['B']/row['C']/3
df['B'] = df.apply(Standard, axis=1)