我有这个镜头数据帧。它有列来分类电影所属的流派。类型类别是行中具有二进制值的列名称。如果电影属于某种类型,则在适当的列下方为1,否则为0。我想计算python pandas中每个用户的每个类型的平均评分。
# pass in column names for each CSV
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('C:/Users/End-User/Desktop/ml-100k/u.user',
sep='|',names=u_cols, encoding='latin-1')
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('C:/Users/End-User/Desktop/ml-100k/u.data',
sep='\t', names=r_cols, encoding='latin-1')
# Reading item file:
m_cols = ['movie_id', 'title' ,'release_date','video_release_date', 'imdb_url',
'unknown', 'Action', 'Adventure', 'Animation', 'Children\'s', 'Comedy',
'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror',
'Musical', 'Mystery', 'Romance', 'Sci-Fi','Thriller', 'War', 'Western']
movies = pd.read_csv('C:/Users/End-User/Desktop/ml-100k/u.item',
sep='|', names=m_cols, encoding='latin-1')
# create one merged DataFrame
movie_ratings = pd.merge(movies, ratings)
lens = pd.merge(movie_ratings, users)
# I have tried this but don't know how to get the average of the ratings for each user.
df = pd.pivot_table(lens, index = ['user_id'],
columns = ['unknown', 'Action', 'Adventure', 'Animation',
'Children\'s', 'Comedy', 'Crime', 'Documentary',
'Drama', 'Fantasy', 'Film-Noir', 'Horror',
'Musical', 'Mystery', 'Romance', 'Sci-Fi',
'Thriller', 'War', 'Western'],
values = ['rating'])
print df
答案 0 :(得分:0)
考虑从宽到长重塑数据框以创建流派列,然后使用其aggfunc
参数通过pivot_table()
运行结果,特别是numpy mean:
import pandas as pd
import numpy as np
#...same code...
lens = pd.merge(movie_ratings, users)
genrecols = ['unknown', 'Action', 'Adventure', 'Animation',
'Children\'s', 'Comedy', 'Crime', 'Documentary',
'Drama', 'Fantasy', 'Film-Noir', 'Horror',
'Musical', 'Mystery', 'Romance', 'Sci-Fi',
'Thriller', 'War', 'Western']
# RESHAPE DF BY MELTING (WIDE TO LONG), SELECTING ONLY NEEDED FIELDS
mdf = pd.melt(lens[['user_id', 'sex', 'rating'] + genrecols],
id_vars=['user_id', 'sex', 'rating'], var_name='genre')
# FILTER FOR VALUE = 1 AND THREE NEEDED COLUMNS
mdf = mdf[mdf['value']==1][['user_id', 'sex', 'rating', 'genre']]
# RUN PIVOTED AGGREGATION
df = pd.pivot_table(mdf, columns = ['genre'], index = ['user_id', 'sex'],
values = ['rating'], aggfunc = np.mean)
print df