我有一个样本,我想基于该人的电影兴趣的相似度得分来创建一个聚合度量。例如,考虑以下数据。
person <- c( 'John', 'John', 'Vikram', 'Kris', 'Kris', 'Lara', 'Mohi', 'Mohi', 'Mohi')
year<- c(2010, 2011,2010,2010, 2011, 2010, 2010, 2011, 2012)
sciencefiction <- c( 4, 5, 0, 44,32, 5, 32, 43,33)
romantic <- c( 19, 28, 56, 7, 4, 33, 2,1,2)
comedy<- c(22,34, 22,34,44, 54, 54,32,44)
timespent<- c(30,40, 100,33, 22, 80, 96, 22,34)
df<- data.frame(person, year, sciencefiction, romantic, comedy, timespent)
我想要变量称为相似性得分,其基本上由人[i]的距离[j]乘以j花费的时间的总和给出,并且对所有组合进行一年的总和。例如,对于2010年的人约翰来说,这将是
score[john, 2010]= 0.8 * 100+ 0.6 * 33+ .98 * 80 + .73* 96 = 248.28
0.8是由科幻+浪漫+喜剧形成的两个向量之间由余弦角(如上所示)确定的约翰和维克拉姆之间的距离(由ab / | a | | | |计算的余弦距离)(见这里(见) v [i] = 4i + 19j + 22k和v [j] = 0i + 7j + 34k))和100是维克拉姆在2010年观看电影所花费的时间。以类似的方式对约翰进行比较并汇总。有没有办法在R中执行此操作以使用上述过程创建一个名为score的行?谢谢
答案 0 :(得分:0)
我将逐步完成此解决方案。跳到底部以获得总体结果。
预先:因为2012只有一个人(Mohi
),所以没有输出。您可以通过不过滤自我比较(应该得分为0)或重新合并丢失的人/年行来轻松捕获此信息。
更新2 :您的df$person
需要character
,因此请使用
df <- data.frame(..., stringsAsFactors = FALSE)
或使用
进行就地修改df$person <- as.character(df$person)
我在这里使用dplyr
主要是因为我认为它清楚地传达了正在发生的事情。代码中没有任何内容无法替换为基本函数(甚至data.table
)。
library(dplyr)
可以使用tidyr::crossing
代替expand.grid
和purrr::pmap
代替mapply
。他们有优势,但主要是替补,所以我把它留给读者。
简单geometric angle-calculation function,简单/参考
angle <- function(a, b, zero = NaN) {
num <- (a %*% b)
denom <- sqrt(sum(a^2)) * sqrt(sum(b^2))
if (denom == 0) zero else (num / denom)
}
更新:如果其中一个向量全部为0
,则R将0/0
计算为NaN
。根据您的使用情况,将其更改为0
或NA
。
df %>%
distinct(year, person) %>%
group_by(year) %>%
do( expand.grid(person = .$person, person2 = .$person, stringsAsFactors = FALSE) ) %>%
ungroup() %>%
filter(person != person2) %>%
mutate(
p1 = pmin(person, person2),
p2 = pmax(person, person2)
) %>%
distinct() %>%
select(-person, -person2)
# # A tibble: 13 × 3
# year p1 p2
# <dbl> <chr> <chr>
# 1 2010 John Vikram
# 2 2010 John Kris
# 3 2010 John Lara
# 4 2010 John Mohi
# 5 2010 Kris Vikram
# 6 2010 Lara Vikram
# 7 2010 Mohi Vikram
# 8 2010 Kris Lara
# 9 2010 Kris Mohi
# 10 2010 Lara Mohi
# 11 2011 John Kris
# 12 2011 John Mohi
# 13 2011 Kris Mohi
如果您完成了expand.grid
,但最终只能使用冗余对,例如&#34; John,Vikram&#34;和#34;维克拉姆,约翰&#34;。因为我推断您对成对combinations版permutations感兴趣,所以该代码块的其余部分会删除多余的行。
(继续使用以前的数据管道)
... %>%
left_join(setNames(df, paste0(colnames(df), "1")), by = c("p1" = "person1", "year" = "year1")) %>%
left_join(setNames(df, paste0(colnames(df), "2")), by = c("p2" = "person2", "year" = "year2"))
# # A tibble: 13 × 11
# year p1 p2 sciencefiction1 romantic1 comedy1 timespent1 sciencefiction2 romantic2 comedy2 timespent2
# <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2010 John Vikram 4 19 22 30 0 56 22 100
# 2 2010 John Kris 4 19 22 30 44 7 34 33
# 3 2010 John Lara 4 19 22 30 5 33 54 80
# 4 2010 John Mohi 4 19 22 30 32 2 54 96
# 5 2010 Kris Vikram 44 7 34 33 0 56 22 100
# 6 2010 Lara Vikram 5 33 54 80 0 56 22 100
# 7 2010 Mohi Vikram 32 2 54 96 0 56 22 100
# 8 2010 Kris Lara 44 7 34 33 5 33 54 80
# 9 2010 Kris Mohi 44 7 34 33 32 2 54 96
# 10 2010 Lara Mohi 5 33 54 80 32 2 54 96
# 11 2011 John Kris 5 28 34 40 32 4 44 22
# 12 2011 John Mohi 5 28 34 40 43 1 32 22
# 13 2011 Kris Mohi 32 4 44 22 43 1 32 22
... %>%
mutate(
angle = mapply(function(a,b,c, d,e,f) angle(c(a,b,c), c(d,e,f), zero=NA),
sciencefiction1, romantic1, comedy1,
sciencefiction2, romantic2, comedy2, SIMPLIFY = TRUE)
) %>%
select(year, p1, p2, starts_with("timespent"), angle)
# A tibble: 13 × 6
# year p1 p2 timespent1 timespent2 angle
# <dbl> <chr> <chr> <dbl> <dbl> <dbl>
# 1 2010 John Vikram 30 100 0.8768294
# 2 2010 John Kris 30 33 0.6427461
# 3 2010 John Lara 30 80 0.9851037
# 4 2010 John Mohi 30 96 0.7347653
# 5 2010 Kris Vikram 33 100 0.3380778
# 6 2010 Lara Vikram 80 100 0.7948679
# 7 2010 Mohi Vikram 96 100 0.3440492
# 8 2010 Kris Lara 33 80 0.6428056
# 9 2010 Kris Mohi 33 96 0.9256539
# 10 2010 Lara Mohi 80 96 0.7881070
# 11 2011 John Kris 40 22 0.7311130
# 12 2011 John Mohi 40 22 0.5600843
# 13 2011 Kris Mohi 22 22 0.9533073
... %>%
group_by(year, person = p1) %>%
summarize(
score = angle %*% timespent2
) %>%
ungroup()
# # A tibble: 6 × 3
# year person score
# <dbl> <chr> <dbl>
# 1 2010 John 258.23933
# 2 2010 Kris 174.09501
# 3 2010 Lara 155.14507
# 4 2010 Mohi 34.40492
# 5 2011 John 28.40634
# 6 2011 Kris 20.97276
我猜测我的258.24和你的248.28之间的区别是由于第二个向量(Vikram
的值)。
一下子:
df %>%
distinct(year, person) %>%
group_by(year) %>%
do( expand.grid(person = .$person, person2 = .$person, stringsAsFactors = FALSE) ) %>%
ungroup() %>%
filter(person != person2) %>%
mutate(
p1 = pmin(person, person2),
p2 = pmax(person, person2)
) %>%
select(-person, -person2) %>%
distinct() %>%
# p-wise lookups
left_join(setNames(df, paste0(colnames(df), "1")), by = c("p1" = "person1", "year" = "year1")) %>%
left_join(setNames(df, paste0(colnames(df), "2")), by = c("p2" = "person2", "year" = "year2")) %>%
# calc angles
mutate(
angle = mapply(function(a,b,c, d,e,f) angle(c(a,b,c), c(d,e,f)),
sciencefiction1, romantic1, comedy1,
sciencefiction2, romantic2, comedy2, SIMPLIFY = TRUE)
) %>%
# calc scores
group_by(year, person = p1) %>%
summarize(
score = angle %*% timespent2
) %>%
ungroup()