我正在尝试在pysark中给出一组评级来找到基线预测值?
作为其中的一部分,我创建了如下所示的数据框架。
+---------+-----+-----+-----+
|user_code|56691|56693|60762|
+---------+-----+-----+-----+
| 975072| 1| 1| null|
| 975079| 1| 1| null|
| 975076| 2| 1| 2|
+---------+-----+-----+-----+
预期输出:[计算上述矩阵的每个单元的偏差]
+---------+-----+-----+-----+
|user/item|56691|56693|60762|
+---------+-----+-----+-----+
| 975072|1.083|1.086| null|
| 975079|1.083|1.086| null|
| 975076| 3.15| 3.19| 1.0|
+---------+-----+-----+-----+
式:
reg_param1=25,
reg_param2=10
avg=2.0
rows=users who has rated items
columns=items that have been rated
BI=sum((user who rated item i)-avg)/(reg_param1+(number of users who has rated item i ))
BU=sum((items rated by user u)-avg)/(reg_param2+(number of items rated by item u ))
for example:user 975072 has rated item 56691 as 1
for User-975072 item-56691 i.e user who rated item i is 1 and number of users who has rated item i is 3 users (975072,975079,975076)
my BI=(1-2.0)+(1-2.0)+(2-2.0)/(25+3)
and
BU=sum((items rated by user u)-avg -BI)/(reg_param2+(number of items rated by item u ))
这里
(items rated by user u)-avg i.e in this case user user 975072 is rated to
item 56691 i.e (1-2.0-BI) and also he rated item 56693 i.e (1-2.0-BI)
but he is not rated to item 60762 so we no need to conceder so my final calculation like below
BU=(1-2.0-BI)+(1-2.0-BI)/(10+2)
my finall bias=avg+BI+BU
expected 0utput User-975072 item-56691 -1.083
我的预期输出是
+---------+-----+-----+-----+
|user/item|56691|56693|60762|
+---------+-----+-----+-----+
| 975072|1.083|1.086| null|
| 975079|1.083|1.086| null|
| 975076| 3.15| 3.19| 1.0|
+---------+-----+-----+-----+
如何遍历数据帧的每个单元格?