我有一个用户喜好数据框:
+-------+-----+-----+-----+
|user_id|Movie|Music|Books|
+-------+-----+-----+-----+
| 100 | 0 | 1 | 2 |
| 101 | 3 | 1 | 4 |
+-------+---------+-------+
如何1)计算每行(用户)的总和; 2)将每个值除以该总和?所以我得到标准化的首选项值:
+-------+---- -+-------+-------+
|user_id| Movie| Music | Books |
+-------+----- +-------+-------+
| 100 | 0 | 0.33..| 0.66..|
| 101 |0.42..| 0.15..| 0.57..|
+-------+------+-------+-------+
答案 0 :(得分:2)
# get column names that need to be normalized
cols = [col for col in df.columns if col != 'user_id']
# sum the columns by row
rowsum = sum([df[x] for x in cols])
# select user_id and normalize other columns by rowsum
df.select('user_id', *((df[x] / rowsum).alias(x) for x in cols)).show()
+-------+-----+------------------+------------------+
|user_id|Movie| Music| Books|
+-------+-----+------------------+------------------+
| 100| 0.0|0.3333333333333333|0.6666666666666666|
| 101|0.375| 0.125| 0.5|
+-------+-----+------------------+------------------+