根据从其他列的查找设置列值

时间:2019-10-19 16:27:49

标签: python pandas

我有一个 DataFrame ,其中包含日期类别,以及一列,该列显示一次性的事件发生了该类别。我想用直到发生的时间来创建新列,或者要创建一个没有事件的指示符,例如负时间。

数据集很大,而且我想有一种更好的解决方案,而不是通过循环让蛮横的熊猫更好地解决这个问题!

因此,简而言之,如果我创建这样的数据集:

import pandas as pd

#create example dataset
data = {'categories':['a','b','c']*4,'dates':[i for i in range(4) for j in range(3)],'event':[0]*3*4}

#add a couple of events
data['event'][4] = 1
data['event'][9] = 1

df = pd.DataFrame(data)


我如何最好地获得如下所示的输出?

   categories  dates  event  time_until
0           a      0      0           3
1           b      0      0           1
2           c      0      0          -1
3           a      1      0           2
4           b      1      1           0
5           c      1      0          -1
6           a      2      0           1
7           b      2      0          -1
8           c      2      0          -1
9           a      3      1           0
10          b      3      0          -1
11          c      3      0          -1

感谢您的帮助!

3 个答案:

答案 0 :(得分:1)

使用groupby

def f(s):
    s = s.reset_index(drop=True)
    one = s[s.eq(1)]
    if one.empty: return -1
    return -s.index + one.index[0]

df.groupby('categories').event.transform(f)

  categories  dates  event  time_until
0           a      0      0           3
1           b      0      0           1
2           c      0      0          -1
3           a      1      0           2
4           b      1      1           0
5           c      1      0          -1
6           a      2      0           1
7           b      2      0          -1
8           c      2      0          -1
9           a      3      1           0
10          b      3      0          -2
11          c      3      0          -1

请注意,即使事件发生后,它也会找到距离。因此,对于以下事件,您将获得以下输出

event = [0, 0, 0, 1, 0, 0]
until = [3, 2, 1, 0, -1, -2]

如果您需要用-1保留所有负值,那么只需在末尾进行调整

df.time_until.where(df.time_until >= -1, -1)

答案 1 :(得分:0)

替代解决方案:

df.sort_values(by=['categories', 'dates'], ascending=[True, False], inplace=True)
df['tmp'] = df.groupby('categories')['event'].transform('cumsum')
df['time_until'] = df.groupby('categories')['tmp'].transform('cumsum') - 1
df.drop(columns='tmp', inplace=True)
df.sort_values(by=['dates', 'categories'], ascending=[True, True], inplace=True)

输出:

      categories  dates  event  time_until
0           a      0      0           3
1           b      0      0           1
2           c      0      0          -1
3           a      1      0           2
4           b      1      1           0
5           c      1      0          -1
6           a      2      0           1
7           b      2      0          -1
8           c      2      0          -1
9           a      3      1           0
10          b      3      0          -1
11          c      3      0          -1

答案 2 :(得分:-1)

尝试这样的事情:

import pandas as pd
import numpy as np

data = {'categories':['a','b','c']*4,
        'dates':[i for i in range(4) for j in range(3)],
        'event':[0, 1, 0]*4}

df = pd.DataFrame(data)
print(df)

# One way
df.loc[df.event == 0, 'Newevents'] = 'Cancelled'
df.loc[df.event != 0, 'Newevents'] = 'Scheduled'

# Another way
conditions = [
    (df['categories'] == "a"),
    (df['categories'] == "b"),
    (df['categories'] == "c")]
choices = ['None', 'Completed', 'Scheduled']
df['NewCategories'] = np.select(conditions, choices, default='black')
print(df)

输出:

categories  dates  event
0           a      0      0
1           b      0      1
2           c      0      0
3           a      1      0
4           b      1      1
5           c      1      0
6           a      2      0
7           b      2      1
8           c      2      0
9           a      3      0
10          b      3      1
11          c      3      0
categories  dates  event  Newevents NewCategories
0           a      0      0  Cancelled          None
1           b      0      1  Scheduled     Completed
2           c      0      0  Cancelled     Scheduled
3           a      1      0  Cancelled          None
4           b      1      1  Scheduled     Completed
5           c      1      0  Cancelled     Scheduled
6           a      2      0  Cancelled          None
7           b      2      1  Scheduled     Completed
8           c      2      0  Cancelled     Scheduled
9           a      3      0  Cancelled          None
10          b      3      1  Scheduled     Completed
11          c      3      0  Cancelled