在熊猫中的特定值之前获取n行

时间:2020-05-25 09:13:33

标签: python pandas for-loop indexing

说,我有以下数据框:

import pandas as pd
dict = {'val':[3.2, 2.4, -2.3, -4.9, 3.2, 2.4, -2.3, -4.9, 2.4, -2.3, -4.9], 
        'label': [0, 2, 1, -1, 1, 2, -1, -1,1, 1, -1]} 
df = pd.DataFrame(dict) 
df
     val    label
0    3.2     0
1    2.4     2
2   -2.3     1
3   -4.9    -1
4    3.2     1
5    2.4     2
6   -2.3    -1
7   -4.9    -1
8    2.4     1
9   -2.3     1
10  -4.9    -1

我想在列标签中的-1值之前每n(例如2)行。在给定的df中,第一个-1出现在索引3处,我们在其前两行删除索引3,然后下一个-1出现在索引6处,我们再次保留前两行,依此类推。所需的输出如下:

    val     label
1    2.4     2
2   -2.3     1
4    3.2     1
5    2.4     2
6   -2.3    -1
8    2.4     1
9   -2.3     1

感谢任何想法!

3 个答案:

答案 0 :(得分:2)

包含的index方法尽可能简洁,性能更好,因此是更可取的:

idx = df[df.label == -1].index
filtered_idx = (idx -1).union(idx-2)
filtered_idx = filtered_idx[filtered_idx > 0]

df_new = df.iloc[filtered_idx]

""" output
   val  label
1  2.4      2
2 -2.3      1
4  3.2      1
5  2.4      2
6 -2.3     -1
8  2.4      1
9 -2.3      1
"""

针对for loop解决方案的速度比较:

# create large df:
import numpy as np
df = pd.DataFrame(np.random.random((20000000,2)), columns=["val","label"])
df.loc[df.sample(frac=0.01).index, "label"] = - 1

def vectorized_filter(df):
    idx = df[df.label == -1].index
    filtered_idx = (idx -1).union(idx-2)
    df_new = df.iloc[filtered_idx]
    return df_new

def loop_filter(df):
    filter = df.loc[df['label'] == -1].index
    req_idx = []
    for idx in filter:
        if idx == 0:
            continue
        elif idx == 1:
            req_idx.append(idx-1)
        else:
            req_idx.append(idx-2)
            req_idx.append(idx-1)    
    req_idx = list(set(req_idx))
    df2 = df.loc[df.index.isin(req_idx)]
    return df2


start = time.time()
q = vectorized_filter(df)
t1 = time.time() - start

start = time.time()
q2 = loop_filter(df)
t2 = time.time() - start

t2/t1 # ~20 on my machine

答案 1 :(得分:0)

这是一个解决方案:

new_df = pd.DataFrame()
markers = df[df.label.eq(-1)].index
for marker in markers: 
    new_df = new_df.append(df[marker-2:marker])

new_df.reset_index().drop_duplicates().set_index("index")

结果:

       val  label
index            
1      2.4      2
2     -2.3      1
4      3.2      1
5      2.4      2
6     -2.3     -1
8      2.4      1
9     -2.3      1

答案 2 :(得分:0)

filter = df.loc[df['label'] == -1].index

req_idx = []
for idx in filter:
    if idx == 0:
        continue
    elif idx == 1:
        req_idx.append(idx-1)
    else:
        req_idx.append(idx-2)
        req_idx.append(idx-1)

req_idx = list(set(req_idx))


df2 = df.loc[df.index.isin(req_idx)]

print(df2)

输出:

   val  label
1  2.4      2
2 -2.3      1
4  3.2      1
5  2.4      2
6 -2.3     -1
8  2.4      1
9 -2.3      1

如果前两行中的标签为-1,这也应该起作用