我是Python的新手,并试图首先按行总计和列总计对用户电影评级的数据框进行子集化。按列总计过滤需要几个小时才能完成,所以我想知道你是否可以提供一些优化代码的指示。
data_cols = ['user_id','movie_id','rating']
data = pd.read_csv('netflix_data/TrainingRatings.txt', sep=',', names=data_cols)
utrain = (data.sort_values('user_id'))
print(utrain.tail())
Movie_Ratings = utrain.pivot_table(index = ['user_id'],columns = ['movie_id'], values = ['rating'], aggfunc = lambda x:x)
Movie_Ratings.head()
Movie_Ratings = Movie_Ratings.fillna(0)
#Filter by column totals
Movie_Ratings.loc[len(Movie_Ratings)] = [Movie_Ratings[col].sum() for col in Movie_Ratings.columns]
##Following portion is taking the maximum amount of time
x = Movie_Ratings.loc[len(Movie_Ratings)-1]
for col in Movie_Ratings.columns:
if(x[col] <= 500):
Movie_Ratings.drop(col,axis = 1, inplace = True)
答案 0 :(得分:0)
首先,您只能使用DataFrame.sum
:
Movie_Ratings.loc[len(Movie_Ratings)] = Movie_Ratings.sum()
然后过滤无循环:
np.random.seed(100)
Movie_Ratings = pd.DataFrame(np.random.randint(250, size=(5,5)), columns=list('ABCDE'))
print (Movie_Ratings)
A B C D E
0 8 24 67 103 87
1 79 176 138 94 180
2 98 53 66 226 14
3 34 241 240 24 143
4 228 107 60 58 144
Movie_Ratings.loc[len(Movie_Ratings)] = Movie_Ratings.sum()
Movie_Ratings = Movie_Ratings.loc[:, ~(Movie_Ratings.iloc[-1] <= 500)]
#Orchange condition to > and remove ~ for invert condition
#Movie_Ratings = Movie_Ratings.loc[:, (Movie_Ratings.iloc[-1] > 500)]
print (Movie_Ratings)
B C D E
0 24 67 103 87
1 176 138 94 180
2 53 66 226 14
3 241 240 24 143
4 107 60 58 144
5 601 571 505 568
说明:
print (Movie_Ratings.iloc[-1])
A 447
B 601
C 571
D 505
E 568
Name: 5, dtype: int64
print (Movie_Ratings.iloc[-1]<= 500)
A True
B False
C False
D False
E False
Name: 5, dtype: bool
print (~(Movie_Ratings.iloc[-1]<= 500))
A False
B True
C True
D True
E True
Name: 5, dtype: bool