i have a dataframe with client_id column that i want to anonymize without any possibility to roll back.
i want to delete client_id but create the same value for each raw linked to the client (new column)
import pandas as pd
df = pd.DataFrame({
'client_id':[111, 222, 111, 222, 333, 222, 111, 333],
'date':['2018-08-20', '2018-08-22', '2018-08-21', '2018-08-21', '2018-08-18', '2018-08-20', '2018-08-18', '2018-08-19'],
'action':['test1', 'test2', 'test3', 'test4', 'test5', 'test6', 'test7', 'test8']
})
My dataframe:
client_id | date | action |
-----------------------------
111 | '2018-08-20'| test1 |
222 | '2018-08-22'| test2 |
111 | '2018-08-21'| test3 |
222 | '2018-08-21'| test4 |
333 | '2018-08-18'| test5 |
222 | '2018-08-20'| test6 |
111 | '2018-08-18'| test7 |
333 | '2018-08-19'| test8 |
The result expected:
id | date | action |
-----------------------------
1 | '2018-08-20'| test1 |
2 | '2018-08-22'| test2 |
1 | '2018-08-21'| test3 |
2 | '2018-08-21'| test4 |
3 | '2018-08-18'| test5 |
2 | '2018-08-20'| test6 |
1 | '2018-08-18'| test7 |
3 | '2018-08-19'| test8 |
i tried to use pandas.core.groupby.DataFrameGroupBy.rank but it did show the expected result
df['id']= df.groupby("client_id")["date"].rank(ascending=True)
答案 0 :(得分:3)
pandas.factorize
df.assign(client_id=df.client_id.factorize()[0] + 1)
action client_id date
0 test1 1 2018-08-20
1 test2 2 2018-08-22
2 test3 1 2018-08-21
3 test4 2 2018-08-21
4 test5 3 2018-08-18
5 test6 2 2018-08-20
6 test7 1 2018-08-18
7 test8 3 2018-08-19
numpy.unique
df.assign(client_id=np.unique(df.client_id, return_inverse=True)[1] + 1)
action client_id date
0 test1 1 2018-08-20
1 test2 2 2018-08-22
2 test3 1 2018-08-21
3 test4 2 2018-08-21
4 test5 3 2018-08-18
5 test6 2 2018-08-20
6 test7 1 2018-08-18
7 test8 3 2018-08-19