我有一个数据帧,其中包含电影的用户评分。 .describe()
方法提供以下信息:
userId ratings_per_user
count 137658.000000 137658.000000
mean 69247.463068 65.745514
std 39977.471244 67.071719
min 1.000000 1.000000
25% 34628.250000 22.000000
50% 69249.500000 41.000000
75% 103868.750000 84.000000
max 138493.000000 462.000000
现在我想从每个四分位数中抽取X个用户作为样本:
X users with number of votes between min and 25%
X users with number of votes between 25% and 50%
X users with number of votes between 50% and 75%
X users with number of votes between 75% and max
最后列出了大小为4X的userId
个列表。
到目前为止,我所做的是一个笨拙的代码,该代码根据四分位数将数据帧分为4个不同的数据帧,并从每个样本中采样X个用户,然后合并所得的数据帧。但是我希望有一个更简单(更快)的解决方案。
编辑:
更好的解决方案:
#define total sample size desired
N = 100
ratings_user = ratings.groupby(['userId']).size().reset_index(name='ratings_per_user')
ratings_user['categ'] = np.where(ratings_user['ratings_per_user']>=84.0, 'A',
np.where(ratings_user['ratings_per_user']>=41.0, 'B',
np.where(ratings_user['ratings_per_user']>22.0, 'C', 'D'
)))
ratings = ratings_user.groupby('categ', group_keys=False).apply(lambda x: x.sample(int(np.rint(N*len(x)/len(ratings_user))))).sample(frac=1).reset_index(drop=True)