我正在制作电影推荐系统。我需要一个python代码,它将从Excel工作表导入的数据转换为设置格式(如下所示)。
从Excel工作表导入数据的代码:
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
df = pd.read_excel('project.xlsx', sheetname='Sheet1')
df.head(40)
输出我得到:
USER MOVIE RATINGS
0 Julia Roberts Shrek 2.5
1 NaN V for Vendetta 3.5
2 NaN Pretty Woman 3.0
3 NaN Star Wars 3.5
4 NaN While You Were Sleeping 2.5
5 NaN Phone Booth 3.0
6 Drew Barrymore Shrek 3.0
7 NaN V for Vendetta 3.5
8 NaN Pretty Woman 1.5
9 NaN Star Wars 5.0
10 NaN Phone Booth 3.0
11 NaN While You Were Sleeping 3.5
12 Kate Winslet Shrek 2.5
13 NaN V for Vendetta 3.0
14 NaN Star Wars 3.5
15 NaN Phone Booth 4.0
16 Tom Hanks While You Were Sleeping 2.5
17 NaN V for Vendetta 3.5
18 NaN Pretty Woman 3.0
19 NaN Star Wars 4.0
20 NaN Phone Booth 4.5
....
......
......
......
从这里我需要有这样的输出:
dataset={
'Julia Roberts': {
'Shrek': 2.5,
'I am Legend':3.0,
'V for Vendetta': 3.5,
'Pretty Woman': 0,
"My Sister's Keeper":5.0,
'Star Wars': 3.5,
'Me Before You': 3.0,
'While You Were Sleeping': 2.5,
'Phone Booth': 3.0},
'Drew Barrymore': {'Shrek': 3.0,
'V for Vendetta': 3.5,
'Pretty Woman': 1.5,
"My Sister's Keeper":4.0,
'Star Wars': 5.0,
'Phone Booth': 3.0,
'While You Were Sleeping': 3.5},
'Tom Hanks': {'V for Vendetta': 3.5,
'Pretty Woman': 3.0,
'Phone Booth': 4.5,
'Star Wars': 4.0,
'While You Were Sleeping': 2.5,
'I am Legend':3.5},
'Sandra Bullock': {'Shrek': 3.0,
'V for Vendetta': 4.0,
'Pretty Woman': 2.0,
'Star Wars': 3.0,
'I am Legend':4.5,
"My Sister's Keeper":3.5,
'Phone Booth': 3.0,
'While You Were Sleeping': 2.0}
}
我正在使用的代码(但显示错误):
max_nb_row = 0
for sheet in df.sheets():
max_nb_row = max(max_nb_row, sheet.nrows)
for row in range(max_nb_row) :
for sheet in df.sheets() :
if row < sheet.nrows :
print (sheet.row(row))
答案 0 :(得分:0)
你可以使用这个难以理解的单行:
df.ffill().groupby('user').apply(lambda x: dict(zip(x['movie'], x['ratings']))).to_dict()
为了想象发生了什么,我们将使用这个较小的数据框:
>>> df
user movie ratings
0 Julia Roberts Shrek 2.5
1 NaN V for Vendetta 3.5
2 NaN Pretty Woman 3.0
3 Drew Barrymore Shrek 3.0
4 NaN V for Vendetta 3.5
一步一步,这就是:
使用ffill
将NaN
列中的user
值替换为上面的名称。
user movie ratings
0 Julia Roberts Shrek 2.5
1 Julia Roberts V for Vendetta 3.5
2 Julia Roberts Pretty Woman 3.0
3 Drew Barrymore Shrek 3.0
4 Drew Barrymore V for Vendetta 3.5
使用groupby('user')
按用户分组数据
使用apply(lambda x: dict(zip(x['movie'], x['ratings']))
创建{movie: rating}
对的词组。
user
Drew Barrymore {'Shrek': 3.0, 'V for Vendetta': 3.5}
Julia Roberts {'Shrek': 2.5, 'V for Vendetta': 3.5, 'Pretty ...
dtype: object
在最终数据框上调用to_dict()
以获得所需的结果。
{'Drew Barrymore': {'Shrek': 3.0, 'V for Vendetta': 3.5},
'Julia Roberts': {'Pretty Woman': 3.0, 'Shrek': 2.5, 'V for Vendetta': 3.5}}