Pandas DataFrame:每行依次在不同列中具有相同值的数量

时间:2018-11-13 11:25:14

标签: python pandas dataframe

让我们假设我在Python中有一个pandas DataFrame,它显示了一段时间内不同部门的业务部门负责人的姓名。可能看起来像这样

          Leader_Jan Leader_Feb Leader_Mar Leader_Apr
Unit1       Nina       Nina       Nina       Nina
Unit2       Lena       Lena        NaN       Lena
Unit3       Alex      Maria       Alex       Alex
Unit4     Emilia        NaN        NaN        NaN
Unit5        NaN    Corinna      Petra        NaN

,可以按照以下方式重新创建:

import pandas as pd
import numpy as np
a = ['Nina','Nina','Nina','Nina']
b = ['Lena','Lena',np.NaN,'Lena']
c = ['Alex','Maria','Alex','Alex']
d = ['Emilia',np.NaN,np.NaN,np.NaN]
e = [np.NaN,'Corinna','Petra',np.NaN]
data = pd.DataFrame(data=[a,b,c,d,e], columns =['Leader_Jan','Leader_Feb','Leader_Mar','Leader_Apr'], index=['Unit1','Unit2','Unit3','Unit4','Unit5'])

上下文:我想找出领导者在哪些单位中呆的时间很短或很长(以月为单位),以便以后找出我公司特定部门中是否存在团队冲突。

我想在一个 不间断 期间,将领导者到场的时间的最小值和最大值(以月为单位)添加到数据框中。由于可能的中断(请参阅第2单元和第3单元),我不能仅对每行中的不同名称使用value_counts。我宁愿需要找到由NaN值和其他名称分隔的不同领导者名称的序列长度。要查看我认为的顺序,请检查这张照片中的不同颜色:

sequences_colored

您可能会看到,如第2单元和第3单元中所述的中断将导致多次停留。序列中的NaN月数不应该计算。

结果应如下所示:

      Leader_Jan Leader_Feb Leader_Mar Leader_Apr  Min_length_of_stay_leaders  \
Unit1       Nina       Nina       Nina       Nina                           4   
Unit2       Lena       Lena        NaN       Lena                           1   
Unit3       Alex      Maria       Alex       Alex                           1   
Unit4     Emilia        NaN        NaN        NaN                           1   
Unit5        NaN    Corinna      Petra        NaN                           1   

       Max_length_of_stay_leaders  
Unit1                           4  
Unit2                           2  
Unit3                           2  
Unit4                           1  
Unit5                           1 

我知道这可能会很复杂,但是我会喜欢任何帮助/提示等,因为我在这里有些迷路。

2 个答案:

答案 0 :(得分:2)

使用itertools.groupby实际上很容易:

from itertools import groupby

def min_max_durations(row):
    # the group object consumes the iterator, but we don't care about the values 
    # so we just sum "1" to get the length.
    # Taken from https://stackoverflow.com/questions/44490079/how-to-turn-an-itertools-grouper-object-into-a-list
    durations = [sum(1 for _ in group) for key, group in groupby(row) if not isinstance(key, float)]
    return min(durations), max(durations)

data["min_lengths_of_stay"], data["max_lengths_of_stay"] = zip(*data.apply(min_max_durations, axis=1))

float的实例检查只是从此处计数中删除NaN值的一种快速方法,您可以使其变得任意复杂。

这将输出正确的结果(请注意,与您的示例不同,粘贴您的复制代码的副本在Unit3中具有3个“ Alex”条目)

      Leader_Jan Leader_Feb Leader_Mar Leader_Apr  min_lengths_of_stay  \
Unit1       Nina       Nina       Nina       Nina                    4   
Unit2       Lena       Lena        NaN       Lena                    1   
Unit3      Maria       Alex       Alex       Alex                    1   
Unit4     Emilia        NaN        NaN        NaN                    1   
Unit5        NaN    Corinna      Petra        NaN                    1   
       max_lengths_of_stay  
Unit1                    4  
Unit2                    2  
Unit3                    3  
Unit4                    1  
Unit5                    1  

答案 1 :(得分:1)

这应该让您入门-

temp = df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount()+1, axis=1)

mins = temp.min(1)
maxs = temp.max(1)
mask = temp.apply(lambda x: x.is_monotonic_increasing and x.is_unique, axis=1)
mins.loc[mask] = maxs.loc[mask]
mins.name='Min_length_of_stay_leaders'
maxs.name='Max_length_of_stay_leaders'

df.join(mins).join(maxs)

输出

      Leader_Jan Leader_Feb Leader_Mar Leader_Apr  Min_length_of_stay_leaders  \
Unit1       Nina       Nina       Nina       Nina                           4   
Unit2       Lena       Lena        NaN       Lena                           1   
Unit3       Alex      Maria       Alex       Alex                           1   
Unit4     Emilia        NaN        NaN        NaN                           1   
Unit5        NaN    Corinna      Petra        NaN                           1   

       Max_length_of_stay_leaders  
Unit1                           4  
Unit2                           2  
Unit3                           2  
Unit4                           1  
Unit5                           1 

说明

temp = df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount()+1, axis=1)

这使您连续获得按领导者姓名分组的领导者数量-

    Leader_Jan  Leader_Feb  Leader_Mar  Leader_Apr
Unit1   1   2   3   4
Unit2   1   2   1   1
Unit3   1   1   1   2
Unit4   1   1   1   1
Unit5   1   1   1   1

只需提取maxmin-

mins = temp.min(1)
maxs = temp.max(1)

然后问题出在Nina上-她一直任职,因此在这种情况下min也必须是4。

因此,仅对于这种边缘情况,mask对象会严格检测单调递增的级数,并在这种情况下替换为max

我仍然不确定它是否适用于所有情况,所以请检查