我有一个像这样的数据框:
df
col1 col2 col3 col4
1 2 P Q
4 2 R S
5 3 P R
我想创建一个函数,该函数使用col3和col4值的输入返回col1和col2值,
例如,如果函数为f,则f([P,Q])的输出将类似于:
col1 col2
1 2
如何使用熊猫以最有效的方式做到这一点?
答案 0 :(得分:3)
如果需要最有效的方法比较numpy数组:
def f(a, b):
#pandas 0.24+
mask = (df['col3'].to_numpy() == a) & (df['col4'].to_numpy() == b)
#all pandas versions yet
#mask = (df['col3'].values == a) & (df['col4'].values == b)
return df.loc[mask, ['col1','col2']]
性能:取决于数据,行数,匹配的行数,但是通常此处比较1d numpy数组的速度更快:
np.random.seed(123)
N = 10000
L = list('PQRSTU')
df = pd.DataFrame({'col1': np.random.randint(10, size=N),
'col2': np.random.randint(10, size=N),
'col3': np.random.choice(L, N),
'col4': np.random.choice(L, N)})
print (df)
def f(a, b):
#pandas 0.24+
mask = (df['col3'].to_numpy() == a) & (df['col4'].to_numpy() == b)
#all pandas versions yet
#mask = (df['col3'].values == a) & (df['col4'].values == b)
return df.loc[mask, ['col1','col2']]
def f1(first, second):
return df.loc[(df['col3'] == first) & (df['col4'] == second), ['col1', 'col2']]
In [91]: %timeit (f('P', 'Q'))
2.05 ms ± 13.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [92]: %timeit (f1('P', 'Q'))
3.52 ms ± 24.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
答案 1 :(得分:3)
只需使用布尔掩码:
def f(first, second):
return df.loc[(df['col3'] == first) & (df['col4'] == second), ['col1', 'col2']]
答案 2 :(得分:2)
**Simple line of code can do this**
在“ P”和“ Q”位置,您应该输入要与之匹配的值。
df[(df.col3 == 'P') & (df.col4 == 'Q')][col1,col2]
答案 3 :(得分:0)
您可以尝试以下代码:
def func(x):
series = f(x['col3'], c['col4'])
return series.append(x)
dataframe = dataframe.apply(lambda x: func(x))