在python数据帧中的每一列的最大值之前查找值的索引

时间:2019-06-13 16:32:40

标签: python pandas dataframe

我有一个如下数据框。

test = pd.DataFrame({'col1':[0,0,1,0,0,0,1,2,0], 'col2': [0,0,1,2,3,0,0,0,0]})
   col1  col2
0     0     0
1     0     0
2     1     1
3     0     2
4     0     3
5     0     0
6     1     0
7     2     0
8     0     0

对于每一列,我想在每一列的最大值之前找到值1的索引。例如,对于第一列,最大值为2,在2之前的值1的索引为6。对于第二列,最大值为3,在3之前的值1的索引为2。

总而言之,我希望获得[6,2]作为此测试DataFrame的输出。有没有一种快速的方法来实现这一目标?

6 个答案:

答案 0 :(得分:5)

使用Series.mask隐藏不为1的元素,然后将Series.last_valid_index应用于每一列。

m = test.eq(test.max()).cumsum().gt(0) | test.ne(1) 
test.mask(m).apply(pd.Series.last_valid_index)

col1    6
col2    2
dtype: int64

使用numpy进行矢量化,可以使用numpy.cumsumargmax

idx = ((test.eq(1) & test.eq(test.max()).cumsum().eq(0))
            .values
            .cumsum(axis=0)
            .argmax(axis=0))
idx
# array([6, 2])

pd.Series(idx, index=[*test])

col1    6
col2    2
dtype: int64

答案 1 :(得分:4)

使用last_valid_index的@ cs95想法:

test.apply(lambda x: x[:x.idxmax()].eq(1)[lambda i:i].last_valid_index())

输出:

col1    6
col2    2
dtype: int64

解释:

使用索引切片将每一列切割为最大值,然后查找等于1的值并找到最后一个真值的索引。

或者按照@QuangHoang的建议:

test.apply(lambda x: x[:x.idxmax()].eq(1).cumsum().idxmax()) 

答案 2 :(得分:4)

脾气暴躁

t = test.to_numpy()
a = t.argmax(0)

i, j = np.where(t == 1)
mask = i <= a[j]
i = i[mask]
j = j[mask]

b = np.empty_like(a)
b.fill(-1)

np.maximum.at(b, j, i)

pd.Series(b, test.columns)

col1    6
col2    2
dtype: int64

apply

test.apply(lambda s: max(s.index, key=lambda x: (s[x] == 1, s[x] <= s.max(), x)))

col1    6
col2    2
dtype: int64

cummax

test.eq(1).where(test.cummax().lt(test.max())).iloc[::-1].idxmax()

col1    6
col2    2
dtype: int64

计时

我只是想使用一个新工具并做一些标记 see this post

结果

r.to_pandas_dataframe().T

         10        31        100       316       1000      3162      10000
al_0  0.003696  0.003718  0.005512  0.006210  0.010973  0.007764  0.012008
wb_0  0.003348  0.003334  0.003913  0.003935  0.004583  0.004757  0.006096
qh_0  0.002279  0.002265  0.002571  0.002643  0.002927  0.003070  0.003987
sb_0  0.002235  0.002246  0.003072  0.003357  0.004136  0.004083  0.005286
sb_1  0.001771  0.001779  0.002331  0.002353  0.002914  0.002936  0.003619
cs_0  0.005742  0.005751  0.006748  0.006808  0.007845  0.008088  0.009898
cs_1  0.004034  0.004045  0.004871  0.004898  0.005769  0.005997  0.007338
pr_0  0.002484  0.006142  0.027101  0.085944  0.374629  1.292556  6.220875
pr_1  0.003388  0.003414  0.003981  0.004027  0.004658  0.004929  0.006390
pr_2  0.000087  0.000088  0.000089  0.000093  0.000107  0.000145  0.000300

fig = plt.figure(figsize=(10, 10))
ax = plt.subplot()
r.plot(ax=ax)

enter image description here

设置

from simple_benchmark import BenchmarkBuilder
b = BenchmarkBuilder()

def al_0(test): return test.apply(lambda x: x.where(x[:x.idxmax()].eq(1)).drop_duplicates(keep='last').idxmin())
def wb_0(df): return (df.iloc[::-1].cummax().eq(df.max())&df.eq(1).iloc[::-1]).idxmax()
def qh_0(test): return (test.eq(1) & (test.index.values[:,None] < test.idxmax().values)).cumsum().idxmax()
def sb_0(test): return test.apply(lambda x: x[:x.idxmax()].eq(1)[lambda i:i].last_valid_index())
def sb_1(test): return test.apply(lambda x: x[:x.idxmax()].eq(1).cumsum().idxmax())
def cs_0(test): return (lambda m: test.mask(m).apply(pd.Series.last_valid_index))(test.eq(test.max()).cumsum().gt(0) | test.ne(1))
def cs_1(test): return pd.Series((test.eq(1) & test.eq(test.max()).cumsum().eq(0)).values.cumsum(axis=0).argmax(axis=0), test.columns)
def pr_0(test): return test.apply(lambda s: max(s.index, key=lambda x: (s[x] == 1, s[x] <= s.max(), x)))
def pr_1(test): return test.eq(1).where(test.cummax().lt(test.max())).iloc[::-1].idxmax()
def pr_2(test):
    t = test.to_numpy()
    a = t.argmax(0)

    i, j = np.where(t == 1)
    mask = i <= a[j]
    i = i[mask]
    j = j[mask]

    b = np.empty_like(a)
    b.fill(-1)

    np.maximum.at(b, j, i)

    return pd.Series(b, test.columns)

import math

def gen_test(n):
    a = np.random.randint(100, size=(n, int(math.log10(n)) + 1))
    idx = a.argmax(0)
    while (idx == 0).any():
        a = np.random.randint(100, size=(n, int(math.log10(n)) + 1))
        idx = a.argmax(0)        

    for j, i in enumerate(idx):
        a[np.random.randint(i), j] = 1

    return pd.DataFrame(a).add_prefix('col')

@b.add_arguments('DataFrame Size')
def argument_provider():
    for exponent in np.linspace(1, 3, 5):
        size = int(10 ** exponent)
        yield size, gen_test(size)

b.add_functions([al_0, wb_0, qh_0, sb_0, sb_1, cs_0, cs_1, pr_0, pr_1, pr_2])

r = b.run()

答案 3 :(得分:3)

这里有点逻辑

(df.iloc[::-1].cummax().eq(df.max())&df.eq(1).iloc[::-1]).idxmax()
Out[187]: 
col1    6
col2    2
dtype: int64

答案 4 :(得分:2)

这是numpypandas混合的解决方案:

(test.eq(1) & (test.index.values[:,None] < test.idxmax().values)).cumsum().idxmax()

这比其他解决方案要快。

答案 5 :(得分:1)

我将if (e.Message.Type == Telegram.Bot.Types.Enums.MessageType.Text && e.Message.Text == "/start") { var rmu = new ReplyKeyboardMarkup(); rmu.Keyboard = new KeyboardButton[][] { new KeyboardButton[] { new KeyboardButton("\U0001F525 Yes,I Do!"), new KeyboardButton("\U0001F61E No,I want to Register!") }, }; rmu.ResizeKeyboard = true; rmu.OneTimeKeyboard = true; var message = string.Format("\U0001F44B Hello {0} , welcome to our system. Are you registered before?", e.Message.From.FirstName); Bot.SendTextMessageAsync(e.Message.Chat.Id, message, Telegram.Bot.Types.Enums.ParseMode.Default, false, false, 0, rmu, System.Threading.CancellationToken.None); } if (e.Message.Type == Telegram.Bot.Types.Enums.MessageType.Text) { Console.WriteLine(e.Message.From.Username); Console.WriteLine(e.Message.Text); if(e.Message.Text.Contains("Yes,I Do!")) { var rmu = new ReplyKeyboardMarkup(); rmu.Keyboard = new KeyboardButton[][] { new KeyboardButton[] { new KeyboardButton("\U0001F512 Forgot username or password"), }, }; rmu.ResizeKeyboard = true; Bot.SendTextMessageAsync(e.Message.Chat.Id, "Please enter your username in our system.", Telegram.Bot.Types.Enums.ParseMode.Default, false, false, 0, rmu, System.Threading.CancellationToken.None); } } dropna一起使用,以删除重复的where,并保留最后一个1,并在其上调用1

idxmin