让我们假设我在Python中有一个pandas DataFrame,它显示了一段时间内不同部门的业务部门负责人的姓名。可能看起来像这样
Leader_Jan Leader_Feb Leader_Mar Leader_Apr
Unit1 Nina Nina Nina Nina
Unit2 Lena Lena NaN Lena
Unit3 Alex Maria Alex Alex
Unit4 Emilia NaN NaN NaN
Unit5 NaN Corinna Petra NaN
,可以按照以下方式重新创建:
import pandas as pd
import numpy as np
a = ['Nina','Nina','Nina','Nina']
b = ['Lena','Lena',np.NaN,'Lena']
c = ['Alex','Maria','Alex','Alex']
d = ['Emilia',np.NaN,np.NaN,np.NaN]
e = [np.NaN,'Corinna','Petra',np.NaN]
data = pd.DataFrame(data=[a,b,c,d,e], columns =['Leader_Jan','Leader_Feb','Leader_Mar','Leader_Apr'], index=['Unit1','Unit2','Unit3','Unit4','Unit5'])
上下文:我想找出领导者在哪些单位中呆的时间很短或很长(以月为单位),以便以后找出我公司特定部门中是否存在团队冲突。
我想在一个 不间断 期间,将领导者到场的时间的最小值和最大值(以月为单位)添加到数据框中。由于可能的中断(请参阅第2单元和第3单元),我不能仅对每行中的不同名称使用value_counts。我宁愿需要找到由NaN值和其他名称分隔的不同领导者名称的序列长度。要查看我认为的顺序,请检查这张照片中的不同颜色:
您可能会看到,如第2单元和第3单元中所述的中断将导致多次停留。序列中的NaN月数不应该计算。
结果应如下所示:
Leader_Jan Leader_Feb Leader_Mar Leader_Apr Min_length_of_stay_leaders \
Unit1 Nina Nina Nina Nina 4
Unit2 Lena Lena NaN Lena 1
Unit3 Alex Maria Alex Alex 1
Unit4 Emilia NaN NaN NaN 1
Unit5 NaN Corinna Petra NaN 1
Max_length_of_stay_leaders
Unit1 4
Unit2 2
Unit3 2
Unit4 1
Unit5 1
我知道这可能会很复杂,但是我会喜欢任何帮助/提示等,因为我在这里有些迷路。
答案 0 :(得分:2)
使用itertools.groupby实际上很容易:
from itertools import groupby
def min_max_durations(row):
# the group object consumes the iterator, but we don't care about the values
# so we just sum "1" to get the length.
# Taken from https://stackoverflow.com/questions/44490079/how-to-turn-an-itertools-grouper-object-into-a-list
durations = [sum(1 for _ in group) for key, group in groupby(row) if not isinstance(key, float)]
return min(durations), max(durations)
data["min_lengths_of_stay"], data["max_lengths_of_stay"] = zip(*data.apply(min_max_durations, axis=1))
float
的实例检查只是从此处计数中删除NaN
值的一种快速方法,您可以使其变得任意复杂。
这将输出正确的结果(请注意,与您的示例不同,粘贴您的复制代码的副本在Unit3中具有3个“ Alex”条目)
Leader_Jan Leader_Feb Leader_Mar Leader_Apr min_lengths_of_stay \
Unit1 Nina Nina Nina Nina 4
Unit2 Lena Lena NaN Lena 1
Unit3 Maria Alex Alex Alex 1
Unit4 Emilia NaN NaN NaN 1
Unit5 NaN Corinna Petra NaN 1
max_lengths_of_stay
Unit1 4
Unit2 2
Unit3 3
Unit4 1
Unit5 1
答案 1 :(得分:1)
这应该让您入门-
temp = df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount()+1, axis=1)
mins = temp.min(1)
maxs = temp.max(1)
mask = temp.apply(lambda x: x.is_monotonic_increasing and x.is_unique, axis=1)
mins.loc[mask] = maxs.loc[mask]
mins.name='Min_length_of_stay_leaders'
maxs.name='Max_length_of_stay_leaders'
df.join(mins).join(maxs)
输出
Leader_Jan Leader_Feb Leader_Mar Leader_Apr Min_length_of_stay_leaders \
Unit1 Nina Nina Nina Nina 4
Unit2 Lena Lena NaN Lena 1
Unit3 Alex Maria Alex Alex 1
Unit4 Emilia NaN NaN NaN 1
Unit5 NaN Corinna Petra NaN 1
Max_length_of_stay_leaders
Unit1 4
Unit2 2
Unit3 2
Unit4 1
Unit5 1
说明
temp = df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount()+1, axis=1)
这使您连续获得按领导者姓名分组的领导者数量-
Leader_Jan Leader_Feb Leader_Mar Leader_Apr
Unit1 1 2 3 4
Unit2 1 2 1 1
Unit3 1 1 1 2
Unit4 1 1 1 1
Unit5 1 1 1 1
只需提取max
和min
-
mins = temp.min(1)
maxs = temp.max(1)
然后问题出在Nina
上-她一直任职,因此在这种情况下min
也必须是4。
因此,仅对于这种边缘情况,mask
对象会严格检测单调递增的级数,并在这种情况下替换为max
。
我仍然不确定它是否适用于所有情况,所以请检查