说,我有以下数据框:
import pandas as pd
dict = {'val':[3.2, 2.4, -2.3, -4.9, 3.2, 2.4, -2.3, -4.9, 2.4, -2.3, -4.9],
'label': [0, 2, 1, -1, 1, 2, -1, -1,1, 1, -1]}
df = pd.DataFrame(dict)
df
val label
0 3.2 0
1 2.4 2
2 -2.3 1
3 -4.9 -1
4 3.2 1
5 2.4 2
6 -2.3 -1
7 -4.9 -1
8 2.4 1
9 -2.3 1
10 -4.9 -1
我想在列标签中的-1值之前每n(例如2)行。在给定的df中,第一个-1出现在索引3处,我们在其前两行删除索引3,然后下一个-1出现在索引6处,我们再次保留前两行,依此类推。所需的输出如下:>
val label
1 2.4 2
2 -2.3 1
4 3.2 1
5 2.4 2
6 -2.3 -1
8 2.4 1
9 -2.3 1
感谢任何想法!
答案 0 :(得分:2)
包含的index
方法尽可能简洁,性能更好,因此是更可取的:
idx = df[df.label == -1].index
filtered_idx = (idx -1).union(idx-2)
filtered_idx = filtered_idx[filtered_idx > 0]
df_new = df.iloc[filtered_idx]
""" output
val label
1 2.4 2
2 -2.3 1
4 3.2 1
5 2.4 2
6 -2.3 -1
8 2.4 1
9 -2.3 1
"""
针对for loop
解决方案的速度比较:
# create large df:
import numpy as np
df = pd.DataFrame(np.random.random((20000000,2)), columns=["val","label"])
df.loc[df.sample(frac=0.01).index, "label"] = - 1
def vectorized_filter(df):
idx = df[df.label == -1].index
filtered_idx = (idx -1).union(idx-2)
df_new = df.iloc[filtered_idx]
return df_new
def loop_filter(df):
filter = df.loc[df['label'] == -1].index
req_idx = []
for idx in filter:
if idx == 0:
continue
elif idx == 1:
req_idx.append(idx-1)
else:
req_idx.append(idx-2)
req_idx.append(idx-1)
req_idx = list(set(req_idx))
df2 = df.loc[df.index.isin(req_idx)]
return df2
start = time.time()
q = vectorized_filter(df)
t1 = time.time() - start
start = time.time()
q2 = loop_filter(df)
t2 = time.time() - start
t2/t1 # ~20 on my machine
答案 1 :(得分:0)
这是一个解决方案:
new_df = pd.DataFrame()
markers = df[df.label.eq(-1)].index
for marker in markers:
new_df = new_df.append(df[marker-2:marker])
new_df.reset_index().drop_duplicates().set_index("index")
结果:
val label
index
1 2.4 2
2 -2.3 1
4 3.2 1
5 2.4 2
6 -2.3 -1
8 2.4 1
9 -2.3 1
答案 2 :(得分:0)
filter = df.loc[df['label'] == -1].index
req_idx = []
for idx in filter:
if idx == 0:
continue
elif idx == 1:
req_idx.append(idx-1)
else:
req_idx.append(idx-2)
req_idx.append(idx-1)
req_idx = list(set(req_idx))
df2 = df.loc[df.index.isin(req_idx)]
print(df2)
输出:
val label
1 2.4 2
2 -2.3 1
4 3.2 1
5 2.4 2
6 -2.3 -1
8 2.4 1
9 -2.3 1
如果前两行中的标签为-1,这也应该起作用