大熊猫将每第二个重复的行值设置为零

时间:2019-09-24 10:59:46

标签: pandas

我有一个包含15K行的数据框。如果在col'val1'中重复3,我想将每秒设置为零。如果'val1'不重复,则应保持3。 我可以通过遍历数据框来实现此目的,但这很慢

我有这样的东西:

import pandas as pd


dates = pd.date_range('2008-10-01', periods=15, freq='D')
df = pd.DataFrame({'val1': (0,0,3,3,3,3,3,0,0,3,0,3,3,3,0) },index=dates) 
print(df)
            val1
2008-10-01     0
2008-10-02     0
2008-10-03     3
2008-10-04     3
2008-10-05     3
2008-10-06     3
2008-10-07     3
2008-10-08     0
2008-10-09     0
2008-10-10     3
2008-10-11     0
2008-10-12     3
2008-10-13     3
2008-10-14     3
2008-10-15     0

What I want to archive is this:

df = pd.DataFrame({ 'val1': (0,0,3,3,3,3,3,0,0,3,0,3,3,3,0),'val2': (0,0,3,0,3,0,3,0,0,3,0,3,0,3,0)},index=dates ) 
print(df)

 val1  val2
2008-10-01     0     0
2008-10-02     0     0
2008-10-03     3     3
2008-10-04     3     0
2008-10-05     3     3
2008-10-06     3     0
2008-10-07     3     3
2008-10-08     0     0
2008-10-09     0     0
2008-10-10     3     3
2008-10-11     0     0
2008-10-12     3     3
2008-10-13     3     0
2008-10-14     3     3
2008-10-15     0     0

我发现的唯一可行的解​​决方案是遍历行,这太慢了..:

df['val3']=0
for i in range(0,len(df.index)):

    if (df['val1'][i]==3) & (df['val1'][i-1]==3) & (df['val2'][i-2]!=3):
            df['val3'][i-1]=3

    if (df['val1'][i]==0) & (df['val1'][i-1]==3):
            df['val3'][i-1]=3


val1  val2  val3
2008-10-01     0     0     0
2008-10-02     0     0     0
2008-10-03     3     3     3
2008-10-04     3     0     0
2008-10-05     3     3     3
2008-10-06     3     0     0
2008-10-07     3     3     3
2008-10-08     0     0     0
2008-10-09     0     0     0
2008-10-10     3     3     3
2008-10-11     0     0     0
2008-10-12     3     3     3
2008-10-13     3     0     0
2008-10-14     3     3     3
2008-10-15     0     0     0

Any suggestions to achieve this without iteration or to make iterartion faster?

2 个答案:

答案 0 :(得分:1)

首先,我们创建一个指标,为我们提供每组相同的值,在本例中为所有值3。然后对它们进行分组,并使用2nd获得这些组的每个range(step=2)索引。最后,我们用.loc找到这些索引并分配0

grps = df['val1'].diff().ne(0).cumsum()

idx = df.groupby(grps).apply(lambda x: x.iloc[[x for x in range(1, len(x), 2)]]).index.get_level_values(1)

df.loc[idx, 'val1'] = 0

输出

            val1
2008-10-01     0
2008-10-02     0
2008-10-03     3
2008-10-04     0
2008-10-05     3
2008-10-06     0
2008-10-07     3
2008-10-08     0
2008-10-09     0
2008-10-10     3
2008-10-11     0
2008-10-12     3
2008-10-13     0
2008-10-14     3
2008-10-15     0

答案 1 :(得分:1)

使用:

dates = pd.date_range('2008-10-01', periods=15, freq='D')
df = pd.DataFrame({'val1': (0,0,3,3,3,3,3,0,0,3,0,3,3,3,0) },index=dates) 

#create consecutive groups
g = df['val1'].ne(df['val1'].shift()).cumsum()

#create counter per groups with modulo 2 and compare by 0
m = df.groupby(g).cumcount() % 2 == 0
#alternative, thanks @Erfan
#m = df.groupby(g).cumcount().mod(2).eq(0)

#set new column
df['val2'] = df['val1'].where(m, 0)
            val1  val2
2008-10-01     0     0
2008-10-02     0     0
2008-10-03     3     3
2008-10-04     3     0
2008-10-05     3     3
2008-10-06     3     0
2008-10-07     3     3
2008-10-08     0     0
2008-10-09     0     0
2008-10-10     3     3
2008-10-11     0     0
2008-10-12     3     3
2008-10-13     3     0
2008-10-14     3     3
2008-10-15     0     0