我有一个包含15K行的数据框。如果在col'val1'中重复3,我想将每秒设置为零。如果'val1'不重复,则应保持3。 我可以通过遍历数据框来实现此目的,但这很慢
我有这样的东西:
import pandas as pd
dates = pd.date_range('2008-10-01', periods=15, freq='D')
df = pd.DataFrame({'val1': (0,0,3,3,3,3,3,0,0,3,0,3,3,3,0) },index=dates)
print(df)
val1
2008-10-01 0
2008-10-02 0
2008-10-03 3
2008-10-04 3
2008-10-05 3
2008-10-06 3
2008-10-07 3
2008-10-08 0
2008-10-09 0
2008-10-10 3
2008-10-11 0
2008-10-12 3
2008-10-13 3
2008-10-14 3
2008-10-15 0
What I want to archive is this:
df = pd.DataFrame({ 'val1': (0,0,3,3,3,3,3,0,0,3,0,3,3,3,0),'val2': (0,0,3,0,3,0,3,0,0,3,0,3,0,3,0)},index=dates )
print(df)
val1 val2
2008-10-01 0 0
2008-10-02 0 0
2008-10-03 3 3
2008-10-04 3 0
2008-10-05 3 3
2008-10-06 3 0
2008-10-07 3 3
2008-10-08 0 0
2008-10-09 0 0
2008-10-10 3 3
2008-10-11 0 0
2008-10-12 3 3
2008-10-13 3 0
2008-10-14 3 3
2008-10-15 0 0
我发现的唯一可行的解决方案是遍历行,这太慢了..:
df['val3']=0
for i in range(0,len(df.index)):
if (df['val1'][i]==3) & (df['val1'][i-1]==3) & (df['val2'][i-2]!=3):
df['val3'][i-1]=3
if (df['val1'][i]==0) & (df['val1'][i-1]==3):
df['val3'][i-1]=3
val1 val2 val3
2008-10-01 0 0 0
2008-10-02 0 0 0
2008-10-03 3 3 3
2008-10-04 3 0 0
2008-10-05 3 3 3
2008-10-06 3 0 0
2008-10-07 3 3 3
2008-10-08 0 0 0
2008-10-09 0 0 0
2008-10-10 3 3 3
2008-10-11 0 0 0
2008-10-12 3 3 3
2008-10-13 3 0 0
2008-10-14 3 3 3
2008-10-15 0 0 0
Any suggestions to achieve this without iteration or to make iterartion faster?
答案 0 :(得分:1)
首先,我们创建一个指标,为我们提供每组相同的值,在本例中为所有值3
。然后对它们进行分组,并使用2nd
获得这些组的每个range(step=2)
索引。最后,我们用.loc
找到这些索引并分配0
:
grps = df['val1'].diff().ne(0).cumsum()
idx = df.groupby(grps).apply(lambda x: x.iloc[[x for x in range(1, len(x), 2)]]).index.get_level_values(1)
df.loc[idx, 'val1'] = 0
输出
val1
2008-10-01 0
2008-10-02 0
2008-10-03 3
2008-10-04 0
2008-10-05 3
2008-10-06 0
2008-10-07 3
2008-10-08 0
2008-10-09 0
2008-10-10 3
2008-10-11 0
2008-10-12 3
2008-10-13 0
2008-10-14 3
2008-10-15 0
答案 1 :(得分:1)
使用:
dates = pd.date_range('2008-10-01', periods=15, freq='D')
df = pd.DataFrame({'val1': (0,0,3,3,3,3,3,0,0,3,0,3,3,3,0) },index=dates)
#create consecutive groups
g = df['val1'].ne(df['val1'].shift()).cumsum()
#create counter per groups with modulo 2 and compare by 0
m = df.groupby(g).cumcount() % 2 == 0
#alternative, thanks @Erfan
#m = df.groupby(g).cumcount().mod(2).eq(0)
#set new column
df['val2'] = df['val1'].where(m, 0)
val1 val2
2008-10-01 0 0
2008-10-02 0 0
2008-10-03 3 3
2008-10-04 3 0
2008-10-05 3 3
2008-10-06 3 0
2008-10-07 3 3
2008-10-08 0 0
2008-10-09 0 0
2008-10-10 3 3
2008-10-11 0 0
2008-10-12 3 3
2008-10-13 3 0
2008-10-14 3 3
2008-10-15 0 0