Question

我在csv文件中有以下数据集。标题是 {user_id, movie_id, rating}。

我需要根据这些数据创建一个2-d user_movie评级数组。哪个应该看起来像行 - user_id和columns - movie_id

X 1 2 3 4 5
1 5 2 3 0 4
2 0 1 3 0 5

我将CSV数据加载到数据帧。在熊猫中有没有直接的方法来做到这一点。或者我应该迭代并创建这个二维数组？

我尝试了以下代码和

def data_preprocess(data_file):
    r_cols = ['user_id', 'movie_id', 'rating']
    user_ratings_file = pd.read_csv(data_file, sep='\t', names=r_cols)
    user_ratings_file = user_ratings_file.pivot(index='user_id', columns='movie_id', values='rating').fillna(0).astype(int).reindex(
    columns=np.arange(1, 6), fill_value=0)
    print (user_ratings_file)
    return user_ratings_file

我正在

movie_id  1  2  3  4  5
user_id                
1 1 5     0  0  0  0  0
1 2 2     0  0  0  0  0
1 3 3     0  0  0  0  0
1 5 4     0  0  0  0  0
2 2 1     0  0  0  0  0
2 3 3     0  0  0  0  0

和print (user_ratings_file.pivot(index='user_id', columns='movie_id', values='rating'))给了我

   movie_id  NaN
    user_id      
    1 1 5     NaN
    1 2 2     NaN
    1 3 3     NaN
    1 5 4     NaN
    2 2 1     NaN
    2 3 3     NaN

Answer 1

pivot需要reindex：

<head>
  <base href="https://polygit.org/polymer+1.8.1/components/">
  <script src="webcomponentsjs/webcomponents-lite.min.js"></script>
  <link rel="import" href="polymer/polymer.html">
</head>
<body>
  <custom-element></custom-element>

  <dom-module id="custom-element">
    <template>
      <media-element some-prop="{{typeElement}}"></media-element>
    </template>
  </dom-module>

  <dom-module id="media-element">
    <template>
      <button on-tap="_logSomeProp">Log someProp</button>
      <button on-tap="_incrementSomeProp">Incremeent someProp</button>
    </template>
  </dom-module>
</body>

unstack的另一个解决方案：

df = pd.DataFrame({'user_id': [1, 1, 1, 2, 2, 1], 
                   'rating': [3, 2, 5, 1, 3, 4], 
                   'movie_id': [3, 2, 1, 2, 3, 5]})

df = df.pivot(index='user_id', columns='movie_id', values='rating')
       .fillna(0)
       .astype(int)
       .reindex(columns=np.arange(1,6), fill_value=0)

print (df)
movie_id  1  2  3  4  5
user_id                
1         5  2  3  0  4
2         0  1  3  0  0

但如果得到：

ValueError：索引包含重复的条目，无法重塑

需要聚合重复：

df = df.set_index(['user_id','movie_id'])['rating']
       .unstack(fill_value=0)
       .reindex(columns=np.arange(1,6), fill_value=0)
print (df)
movie_id  1  2  3  4  5
user_id                
1         5  2  3  0  4
2         0  1  3  0  0

或使用pivot_table与print (df) user_id movie_id rating 0 1 3 3 1 1 2 2 2 1 1 5 3 2 2 1 4 2 3 3 <-duplicates for 2,3 5 2 3 8 <-duplicates for 2,3 6 1 5 4 df = df.groupby(['user_id','movie_id'])['rating'] .mean() .unstack(fill_value=0) .reindex(columns=np.arange(1,6), fill_value=0) print (df) movie_id 1 2 3 4 5 user_id 1 5.0 2.0 3.0 0 4.0 2 0.0 1.0 5.5 0 0.0：

aggfunc

使用pandas将csv中的数据聚合到2d数组

1 个答案: