在pysark中给出一组评级,找到基线预测变量?

时间:2018-03-21 12:56:13

标签: python apache-spark dataframe pyspark pyspark-sql

我正在尝试在pysark中给出一组评级来找到基线预测值?

作为其中的一部分,我创建了如下所示的数据框架。

+---------+-----+-----+-----+
|user_code|56691|56693|60762|
+---------+-----+-----+-----+
|   975072|    1|    1| null|
|   975079|    1|    1| null|
|   975076|    2|    1|    2|
+---------+-----+-----+-----+

预期输出:[计算上述矩阵的每个单元的偏差]

+---------+-----+-----+-----+
|user/item|56691|56693|60762|
+---------+-----+-----+-----+
|   975072|1.083|1.086| null|
|   975079|1.083|1.086| null|
|   975076| 3.15| 3.19|  1.0|
+---------+-----+-----+-----+

式:

reg_param1=25,
reg_param2=10
avg=2.0
rows=users who has rated items 
columns=items that have been rated
BI=sum((user who rated item i)-avg)/(reg_param1+(number of users who has rated item i ))
BU=sum((items rated by user u)-avg)/(reg_param2+(number of items rated by item u ))


 for example:user 975072 has rated item 56691 as 1
    for User-975072 item-56691  i.e  user who rated item i is 1 and number of users who has rated item i is 3 users (975072,975079,975076)
    my BI=(1-2.0)+(1-2.0)+(2-2.0)/(25+3)
    and 
    BU=sum((items rated by user u)-avg -BI)/(reg_param2+(number of items rated by item u ))

这里

 (items rated by user u)-avg i.e in this case user user 975072 is rated to 

  item 56691 i.e (1-2.0-BI) and also he rated item 56693 i.e (1-2.0-BI)
    but he is not rated to item 60762 so we no need to conceder so my final calculation like below
    BU=(1-2.0-BI)+(1-2.0-BI)/(10+2)

    my finall bias=avg+BI+BU
    expected 0utput User-975072 item-56691  -1.083

我的预期输出是

+---------+-----+-----+-----+
|user/item|56691|56693|60762|
+---------+-----+-----+-----+
|   975072|1.083|1.086| null|
|   975079|1.083|1.086| null|
|   975076| 3.15| 3.19|  1.0|
+---------+-----+-----+-----+

如何遍历数据帧的每个单元格?

0 个答案:

没有答案