如何从Excel工作表中获取数据并以设置格式获取输出?

时间:2018-03-27 10:06:47

标签: python excel pandas

我正在制作电影推荐系统。我需要一个python代码,它将从Excel工作表导入的数据转换为设置格式(如下所示)。

enter image description here

从Excel工作表导入数据的代码:

import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile

df = pd.read_excel('project.xlsx', sheetname='Sheet1')
df.head(40)

输出我得到:

        USER       MOVIE    RATINGS
0   Julia Roberts   Shrek   2.5
1   NaN         V for Vendetta  3.5
2   NaN         Pretty Woman    3.0
3   NaN            Star Wars    3.5
4   NaN    While You Were Sleeping  2.5
5   NaN     Phone Booth 3.0
6   Drew Barrymore  Shrek   3.0
7   NaN       V for Vendetta    3.5
8   NaN     Pretty Woman    1.5
9   NaN        Star Wars    5.0
10  NaN      Phone Booth    3.0
11  NaN   While You Were Sleeping   3.5
12  Kate Winslet       Shrek    2.5
13  NaN       V for Vendetta    3.0
14  NaN        Star Wars    3.5
15  NaN       Phone Booth   4.0
16  Tom Hanks   While You Were Sleeping 2.5
17  NaN           V for Vendetta    3.5
18  NaN         Pretty Woman    3.0
19  NaN         Star Wars   4.0
20  NaN     Phone Booth 4.5
....
......
......
......

enter image description here

从这里我需要有这样的输出:

dataset={
 'Julia Roberts': {
 'Shrek': 2.5,
 'I am Legend':3.0,
 'V for Vendetta': 3.5,
 'Pretty Woman': 0,
 "My Sister's Keeper":5.0,
 'Star Wars': 3.5,
 'Me Before You': 3.0,
 'While You Were Sleeping': 2.5,
 'Phone Booth': 3.0},

 'Drew Barrymore': {'Shrek': 3.0,
 'V for Vendetta': 3.5,
 'Pretty Woman': 1.5,
 "My Sister's Keeper":4.0,
 'Star Wars': 5.0,
 'Phone Booth': 3.0,
 'While You Were Sleeping': 3.5},


 'Tom Hanks': {'V for Vendetta': 3.5,
 'Pretty Woman': 3.0,
 'Phone Booth': 4.5,
 'Star Wars': 4.0,
 'While You Were Sleeping': 2.5,
 'I am Legend':3.5},

 'Sandra Bullock': {'Shrek': 3.0,
 'V for Vendetta': 4.0,
 'Pretty Woman': 2.0,
 'Star Wars': 3.0,
 'I am Legend':4.5,
 "My Sister's Keeper":3.5, 
 'Phone Booth': 3.0,
 'While You Were Sleeping': 2.0}
}

我正在使用的代码(但显示错误):

max_nb_row = 0
for sheet in df.sheets():
  max_nb_row = max(max_nb_row, sheet.nrows)

for row in range(max_nb_row) :
  for sheet in df.sheets() :
    if row < sheet.nrows :
      print (sheet.row(row))

1 个答案:

答案 0 :(得分:0)

你可以使用这个难以理解的单行:

df.ffill().groupby('user').apply(lambda x: dict(zip(x['movie'], x['ratings']))).to_dict()

为了想象发生了什么,我们将使用这个较小的数据框:

>>> df
             user           movie  ratings
0   Julia Roberts           Shrek      2.5
1             NaN  V for Vendetta      3.5
2             NaN    Pretty Woman      3.0
3  Drew Barrymore           Shrek      3.0
4             NaN  V for Vendetta      3.5

一步一步,这就是:

  1. 使用ffillNaN列中的user值替换为上面的名称。

                 user           movie  ratings
    0   Julia Roberts           Shrek      2.5
    1   Julia Roberts  V for Vendetta      3.5
    2   Julia Roberts    Pretty Woman      3.0
    3  Drew Barrymore           Shrek      3.0
    4  Drew Barrymore  V for Vendetta      3.5
    
  2. 使用groupby('user')按用户分组数据

  3. 使用apply(lambda x: dict(zip(x['movie'], x['ratings']))创建{movie: rating}对的词组。

    user
    Drew Barrymore    {'Shrek': 3.0, 'V for Vendetta': 3.5}
    Julia Roberts     {'Shrek': 2.5, 'V for Vendetta': 3.5, 'Pretty ...
    dtype: object
    
  4. 在最终数据框上调用to_dict()以获得所需的结果。

    {'Drew Barrymore': {'Shrek': 3.0, 'V for Vendetta': 3.5},
     'Julia Roberts': {'Pretty Woman': 3.0, 'Shrek': 2.5, 'V for Vendetta': 3.5}}