Question

我有一个用户喜好数据框：

+-------+-----+-----+-----+
|user_id|Movie|Music|Books|
+-------+-----+-----+-----+
|   100 |  0  |  1  |  2  |
|   101 |  3  |  1  |  4  |
+-------+---------+-------+

如何1）计算每行（用户）的总和； 2）将每个值除以该总和？所以我得到标准化的首选项值：

+-------+---- -+-------+-------+
|user_id| Movie| Music | Books |
+-------+----- +-------+-------+
|   100 |  0   | 0.33..| 0.66..|
|   101 |0.42..| 0.15..| 0.57..|
+-------+------+-------+-------+

Answer 1

# get column names that need to be normalized
cols = [col for col in df.columns if col != 'user_id']

# sum the columns by row
rowsum = sum([df[x] for x in cols])

# select user_id and normalize other columns by rowsum
df.select('user_id', *((df[x] / rowsum).alias(x) for x in cols)).show()

+-------+-----+------------------+------------------+
|user_id|Movie|             Music|             Books|
+-------+-----+------------------+------------------+
|    100|  0.0|0.3333333333333333|0.6666666666666666|
|    101|0.375|             0.125|               0.5|
+-------+-----+------------------+------------------+

如何在pyspark中将数据帧行的每个值除以行的总和（数据归一化）？

1 个答案: