Question

我有两个数据框，一个是用户项评级，另一个是项目的辅助信息：

#df1
A12VH45Q3H5R5I B000NWJTKW 5.0
A3J8AQWNNI3WSN B000NWJTKW 4.0
A1XOBWIL4MILVM BDASK99000 1.0

#df2
B000NWJTKW ....
BDASK99000 ....

现在我想将item和user的名称映射到整数ID。我知道factorize有一种方式：

df.apply(lambda x: pd.factorize(x)[0] + 1)

但我想确保两个数据框中项的整数是一致的。因此得到的数据帧是：

#df1
1       1      5.0
2       1      4.0
3       2      1.0

#df2
1      ...
2      ...

你知道如何确保吗？提前谢谢！

Answer 1

连接公共列，并对其应用pd.factorize（或pd.Categorical）：

codes, uniques = pd.factorize(pd.concat([df1['item'], df2['item']]))
df1['item'] = codes[:len(df1)] + 1
df2['item'] = codes[len(df1):] + 1

例如，

import pandas as pd

df1 = pd.DataFrame(
[('A12VH45Q3H5R5I', 'B000NWJTKW', 5.0),
 ('A3J8AQWNNI3WSN', 'B000NWJTKW', 4.0),
 ('A1XOBWIL4MILVM', 'BDASK99000', 1.0)], columns=['user', 'item', 'rating'])

df2 = pd.DataFrame(
[('B000NWJTKW', 10),
 ('BDASK99000', 20)], columns=['item', 'extra'])

codes, uniques = pd.factorize(pd.concat([df1['item'], df2['item']]))
df1['item'] = codes[:len(df1)] + 1
df2['item'] = codes[len(df1):] + 1

codes, uniques = pd.factorize(df1['user'])
df1['user'] = codes + 1

print(df1)
print(df2)

产量

# df1
   user  item  rating
0     1     1       5
1     2     1       4
2     3     2       1

# df2
   item  extra
0     1     10
1     2     20

解决问题的另一种方法（如果你有足够的内存）将合并两个DataFrame：df3 = pd.merge(df1, df2, on='item', how='outer')，然后分解df3['item']：

df3 = pd.merge(df1, df2, on='item', how='outer')
for col in ['item', 'user']:
    df3[col] = pd.factorize(df3[col])[0] + 1
print(df3)

产量

   user  item  rating  extra
0     1     1       5     10
1     2     1       4     10
2     3     2       1     20

Answer 2

另一种选择是在第一个数据帧上应用分解，然后将结果映射应用于第二个数据帧：

# create factorization:
idx, levels = pd.factorize(df1['item'])

# replace the item codes in the first dataframe with the new index value
df1['item'] = idx

# create a dictionary mapping the original code to the new index value
d = {code: i for i, code in enumerate(codes)}

# apply this mapping to the second dataframe
df2['item'] = df2.item.apply(lambda code: d[code])

只有在两个数据帧中都存在每个级别时，此方法才有效。

如何用python-pandas分解两个数据帧？

2 个答案: