I have a pandas dataframe with a list of user IDs that are about 40 characters long. I want to replace those user IDs with a number i starting from 0 for each id in order to save space.
What I have:
userID itemID
------------------
3a r5
3a r6
4b r5
4c r6
What I need:
userID itemID
------------------
0 r5
0 r6
1 r5
2 r6
答案 0 :(得分:3)
use pd.factorize():
In [145]: df
Out[145]:
userID itemID
0 3a r5
1 3a r6
2 4b r5
3 4c r6
In [146]: df.userID = pd.factorize(df.userID)[0]
In [147]: df
Out[147]:
userID itemID
0 0 r5
1 0 r6
2 1 r5
3 2 r6
if your main goal is to save memory - you can categorize your column:
In [155]: df = pd.concat([df] * 5, ignore_index=True)
In [156]: df
Out[156]:
userID itemID
0 3a r5
1 3a r6
2 4b r5
3 4c r6
4 3a r5
5 3a r6
6 4b r5
7 4c r6
8 3a r5
9 3a r6
10 4b r5
11 4c r6
12 3a r5
13 3a r6
14 4b r5
15 4c r6
16 3a r5
17 3a r6
18 4b r5
19 4c r6
In [157]: df.memory_usage()
Out[157]:
Index 80
userID 160
itemID 160
dtype: int64
categorizing userID
:
In [158]: df.userID = df.userID.astype('category')
In [159]: df.memory_usage()
Out[159]:
Index 80
userID 44 # <------------ NOTE:
itemID 160
dtype: int64
In [160]: df
Out[160]:
userID itemID
0 3a r5
1 3a r6
2 4b r5
3 4c r6
4 3a r5
5 3a r6
6 4b r5
7 4c r6
8 3a r5
9 3a r6
10 4b r5
11 4c r6
12 3a r5
13 3a r6
14 4b r5
15 4c r6
16 3a r5
17 3a r6
18 4b r5
19 4c r6
In [161]: df.dtypes
Out[161]:
userID category
itemID object
dtype: object