如何在pyspark中将数据帧行的每个值除以行的总和(数据归一化)?

时间:2020-07-31 02:47:40

标签: pyspark

我有一个用户喜好数据框:

+-------+-----+-----+-----+
|user_id|Movie|Music|Books|
+-------+-----+-----+-----+
|   100 |  0  |  1  |  2  |
|   101 |  3  |  1  |  4  |
+-------+---------+-------+

如何1)计算每行(用户)的总和; 2)将每个值除以该总和?所以我得到标准化的首选项值:

+-------+---- -+-------+-------+
|user_id| Movie| Music | Books |
+-------+----- +-------+-------+
|   100 |  0   | 0.33..| 0.66..|
|   101 |0.42..| 0.15..| 0.57..|
+-------+------+-------+-------+

1 个答案:

答案 0 :(得分:2)

# get column names that need to be normalized
cols = [col for col in df.columns if col != 'user_id']

# sum the columns by row
rowsum = sum([df[x] for x in cols])

# select user_id and normalize other columns by rowsum
df.select('user_id', *((df[x] / rowsum).alias(x) for x in cols)).show()

+-------+-----+------------------+------------------+
|user_id|Movie|             Music|             Books|
+-------+-----+------------------+------------------+
|    100|  0.0|0.3333333333333333|0.6666666666666666|
|    101|0.375|             0.125|               0.5|
+-------+-----+------------------+------------------+