Question

我有一个包含学生地址的以下数据框df_address

student_id     address_type     Address          City      
 1                R              6th street      MPLS              
 1                P              10th street SE  Chicago           
 1                E              10th street SE  Chicago           
 2                P              Washington ST   Boston            
 2                E              Essex St        NYC               
 3                E              1040 Taft Blvd  Dallas            
 4                R              24th street      NYC             
 4                P              8th street SE  Chicago           
 5                T              10 Riverside Ave Boston
 6                                20th St          NYC

每个学生可以有多种地址类型：

R代表“住宅”，P代表“永久”，E代表“紧急”，T代表“临时”，addr_type也可以为空白

我要基于以下逻辑填充“ IsPrimaryAddress”列：

如果对于特定学生，如果存在address_type R，则应输入“是” 在IsPrimaryAddress列中的address_type“ R”前面并且“ no”应该写在该特定student_id的其他地址类型前面。

如果address_type R不存在但P存在，则IsPrimaryAddress ='Yes'为'P'和'No' 其余类型

如果P或R都不存在，但E存在，则IsPrimaryAddress ='Yes'为'E' 如果P，R或E不存在，但'T'存在，则IsPrimaryAddress ='Yes'为'T' 结果数据框如下所示：

student_id     address_type     Address          City      IsPrimaryAddress
 1                R              6th street      MPLS              Yes
 1                P              10th street SE  Chicago           No
 1                E              10th street SE  Chicago           No
 2                P              Washington ST   Boston            Yes
 2                E              Essex St        NYC               No
 3                E              1040 Taft Blvd  Dallas            Yes
 4                R              24th street      NYC             Yes
 4                P              8th street SE  Chicago           No
 5                T              10 Riverside Ave Boston          Yes
 6                                20th St          NYC           Yes

如何实现？我在address_type上尝试了等级和累加功能，但无法正常工作。

Answer 1

首先使用Categorical使address_type可以进行自定义排序

df.address_type=pd.Categorical(df.address_type,['R','P','E','T',''],ordered=True)

df=df.sort_values('address_type') # the sort the values
df['new']=(df.groupby('student_id').address_type.transform('first')==df.address_type).map({True:'Yes',False:'No'}) # since we sorted the value , so the first value of each group is the one we need to mark as Yes
df=df.sort_index() # sort the index order back to the original df


   student_id address_type  new
0           1            R  Yes
1           1            P   No
2           1            E   No
3           2            P  Yes
4           2            E   No
5           3            E  Yes
6           4            R  Yes
7           4            P   No
8           5            T  Yes
9           6               Yes

在一列上有条件的pandas groupby可以填充另一列

1 个答案: