根据数据重复性生成唯一ID

时间:2020-09-09 05:07:37

标签: pandas python-2.7

所以我有一个这样的数据框,

ID,SUBJECT_CODE,SUBJECT_GROUP,CLASS_ID,CAMPUS_ID
1,g1,VP2K,c1,r1
2,g1,VP2K,c1,r1
3,g1,VP3K,c2,r2
4,g1,VP3K,c2,r2
5,g1,VP3K,c3,r3

我必须维护一列CORR_ID,其值是所有唯一行的唯一UUID(uuid.uuid4().int),重复行的值是相同的UUID。如果一行具有相同的CLASS_IDCAMPUS_IDsubset=['CLASS_ID','CAMPUS_ID']),则该行被视为重复

预期结果,

ID,SUBJECT_CODE,SUBJECT_GROUP,CLASS_ID,CAMPUS_ID,CORR_ID
1,g1,VP2K,c1,r1,142313746482664936587190810281013480411   //notice that the uuid of both 1st and 3rd rows are same, as both have same ['CLASS_ID','CAMPUS_ID']. Similarly for the 2nd and 4th rows.
2,g1,VP3K,c2,r2,342313743483664636887990810281013450392
3,g1,VP2K,c1,r1,142313746482664936587190810281013480411
4,g1,VP3K,c2,r2,342313743483664636887990810281013450392
5,g1,VP3K,c3,r3,247313743481654636887998810278015678903

所以,我想知道是否有Python方式可以做到这一点。希望能有所帮助。谢谢。

1 个答案:

答案 0 :(得分:0)

对我来说,问题是将大整数保存到pandas列,因为OverflowError错误。可能的解决方案是将值转换为Decimal

from decimal import Decimal

f = lambda x: Decimal(uuid.uuid4().int)
df['CORR_ID'] = df.groupby(['CLASS_ID','CAMPUS_ID'])['CLASS_ID'].transform(f)
print (df)
   ID SUBJECT_CODE SUBJECT_GROUP CLASS_ID CAMPUS_ID  \
0   1           g1          VP2K       c1        r1   
1   2           g1          VP2K       c1        r1   
2   3           g1          VP3K       c2        r2   
3   4           g1          VP3K       c2        r2   
4   5           g1          VP3K       c3        r3   

                                   CORR_ID  
0  169638083186337734039542386251361973037  
1  169638083186337734039542386251361973037  
2  279310814212899708123352457215494669311  
3  279310814212899708123352457215494669311  
4  187655807105121612884740725825459107251