所以我有一个这样的数据框,
ID,SUBJECT_CODE,SUBJECT_GROUP,CLASS_ID,CAMPUS_ID
1,g1,VP2K,c1,r1
2,g1,VP2K,c1,r1
3,g1,VP3K,c2,r2
4,g1,VP3K,c2,r2
5,g1,VP3K,c3,r3
我必须维护一列CORR_ID
,其值是所有唯一行的唯一UUID(uuid.uuid4().int
),重复行的值是相同的UUID。如果一行具有相同的CLASS_ID
和CAMPUS_ID
(subset=['CLASS_ID','CAMPUS_ID']
),则该行被视为重复
预期结果,
ID,SUBJECT_CODE,SUBJECT_GROUP,CLASS_ID,CAMPUS_ID,CORR_ID
1,g1,VP2K,c1,r1,142313746482664936587190810281013480411 //notice that the uuid of both 1st and 3rd rows are same, as both have same ['CLASS_ID','CAMPUS_ID']. Similarly for the 2nd and 4th rows.
2,g1,VP3K,c2,r2,342313743483664636887990810281013450392
3,g1,VP2K,c1,r1,142313746482664936587190810281013480411
4,g1,VP3K,c2,r2,342313743483664636887990810281013450392
5,g1,VP3K,c3,r3,247313743481654636887998810278015678903
所以,我想知道是否有Python方式可以做到这一点。希望能有所帮助。谢谢。
答案 0 :(得分:0)
对我来说,问题是将大整数保存到pandas列,因为OverflowError
错误。可能的解决方案是将值转换为Decimal
:
from decimal import Decimal
f = lambda x: Decimal(uuid.uuid4().int)
df['CORR_ID'] = df.groupby(['CLASS_ID','CAMPUS_ID'])['CLASS_ID'].transform(f)
print (df)
ID SUBJECT_CODE SUBJECT_GROUP CLASS_ID CAMPUS_ID \
0 1 g1 VP2K c1 r1
1 2 g1 VP2K c1 r1
2 3 g1 VP3K c2 r2
3 4 g1 VP3K c2 r2
4 5 g1 VP3K c3 r3
CORR_ID
0 169638083186337734039542386251361973037
1 169638083186337734039542386251361973037
2 279310814212899708123352457215494669311
3 279310814212899708123352457215494669311
4 187655807105121612884740725825459107251