我已尝试将此问题中提供的解决方案应用于我的真实数据:Selecting rows in a MultiIndexed dataframe。不知怎的,我无法得到应该给出的结果。我已经附加了数据框以供选择,以及结果。
我需要什么;
应返回第3,11和12行(当您连续添加4列时,也应选择12列。现在不是。)
df_test = pd.read_csv('df_test.csv')
def find_window(df):
v = df.values
s = np.vstack([np.zeros((1, v.shape[1])), v.cumsum(0)])
threshold = 0
r, c = np.triu_indices(s.shape[0], 1)
d = (c - r)[:, None]
e = s[c] - s[r]
mask = (e / d < threshold).all(1)
rng = np.arange(mask.shape[0])
if mask.any():
idx = rng[mask][d[mask].argmax()]
i0, i1 = r[idx], c[idx]
return pd.DataFrame(
v[i0:i1],
df.loc[df.name].index[i0:i1],
df.columns
)
cols = ['2012', '2013', '2014', '2015']
df_test.groupby(level=0)[cols].apply(find_window)
csv_file在这里:https://docs.google.com/spreadsheets/d/19oOoBdAs3xRBWq6HReizlqrkWoQR2159nk8GWoR_4-g/edit?usp=sharing
注意:蓝框=应该返回的行,黄框是连续的列值,其是&lt; 0(阈值)。
答案 0 :(得分:1)
由于您的解决方案看起来应该可以工作,因此我无法找出修改链接到的原始问题的方法。但是,这是解决您要寻找的问题的一种迭代方法。
import pandas as pd
df_test = pd.read_csv('df_test.csv')
print(df_test.head())
"""
bins_DO L T2011 2011 T2012 2012 T2013 2013 T2014 2014 T2015 2015 Ttotal total
0 0 IR1 6 -6.06 13 -3.22 12 -1.60 7 14.64 12 -18.20 50 -14.44
1 1 IR1 14 -16.32 12 -6.74 14 -1.22 5 1.58 8 -0.42 53 -23.12
2 2 IR1 10 -9.14 10 -0.42 10 11.84 13 -5.74 7 -3.10 50 -6.56
3 3 IR1 9 -13.78 14 -3.14 10 -2.48 6 -0.02 5 -4.78 44 -24.20
4 4 IR1 6 0.54 9 -9.40 15 -11.20 7 0.68 9 12.04 46 -7.34
"""
cols = ['2012', '2013', '2014', '2015']
def process_df(df: pd.DataFrame, cols: list, threshold: float):
# initialize the benchmark
# this gets reset any time the newest row fails the threshold test
base_vals = [0 for _ in cols]
keep_col = []
for row in df[cols].values:
# by default, keep the row
keep_row = True
for x in range(len(cols)):
# if it fails on the row, then make keep row false
if row[x] + base_vals[x] > threshold:
keep_row = False
keep_col.append(keep_row)
if keep_row:
# if we were happy with those results, then keep adding the column values to the base_vals
for x in range(len(cols)):
base_vals[x] += row[x]
else:
# otherwise, reset the base vals
base_vals = [0 for _ in cols]
# only keep rows that we want
df = df.loc[keep_col, :]
return df
new_df = process_df(df = df_test, cols = cols, threshold = 0)
print(new_df)
"""
bins_DO L T2011 2011 T2012 2012 T2013 2013 T2014 2014 T2015 2015 Ttotal total
3 3 IR1 9 -13.78 14 -3.14 10 -2.48 6 -0.02 5 -4.78 44 -24.20
11 11 IR1 7 7.10 10 -10.04 7 -10.60 17 -5.56 11 -8.44 52 -27.54
12 12 IR1 10 -0.28 7 -7.30 8 5.96 8 -12.58 10 -6.86 43 -21.06
"""
答案 1 :(得分:0)
根据您评论中的逻辑,您正在查找列2012/2013,2014,2015中的每个值小于0或累积金额小于0的行。由于第一个条件将始终为真第二个条件是真的,你只需要测试第二个条件。
cols = ['2012', '2013', '2014', '2015']
df.loc[(df[cols].cumsum(axis=1) < 0).all(axis=1), cols]
2012 2013 2014 2015
1 -6.74 -1.22 1.58 -0.42
3 -3.14 -2.48 -0.02 -4.78
4 -9.40 -11.20 0.68 12.04
7 -3.12 -5.74 0.84 1.94
8 -10.14 -12.24 -11.10 15.20
11 -10.04 -10.60 -5.56 -8.44
12 -7.30 5.96 -12.58 -6.86
15 -10.24 -4.16 5.46 -14.00
如果这不是您想要的,请在评论中告诉我。