想象一下,我有一个数据框,列为[A,B,C]
。这些列中的每一个都有一些不同的值。我想再产生一列D
,可以通过以下功能接收它:
def produce_column(i):
# Extract current row by index
raw = df.loc[i]
# Extract previous 3 values for the same sub-df which are before i
df_same = df[
(df['A'] == raw.A)
& (df['B'] == raw.B)
].loc[:i].tail(3)
# Check that we have enough values
if df_same.shape[0] != 3:
return False
# Doesn't matter which function is in use, I just need to apply it on the column / columns
diffs = df_same['C'].map(lambda x: x <= 10 and x > 0)
return all(diffs)
df['D'] = df.index.map(lambda x: produce_column(x))
因此,在每个步骤上,我需要获取数据框,该数据框具有与行相同的属性集,并在此数据框的列上执行一些操作。我有数十万行,因此此代码需要大量时间才能执行。我认为将操作向量化是个好主意,但我不知道该怎么做。也许还有另一种方法可以执行此操作?
谢谢!
UPD 这是一个示例
df = pd.DataFrame([(1,2,3), (4,5,6), (7,8,9)], columns=['A','B','C'])
A B C
0 1 2 3
1 4 5 6
2 7 8 9
df['D'] = df.index.map(lambda x: produce_column(x))
A B C D
0 1 2 3 True
1 4 5 6 True
2 7 8 9 False