So I have this dataset which contains a count of how many people have given specific ratings for a range of products, i.e. there is one column for each rating (1-5) and each row contains the count.
ID ratings_count_5 ratings_count_4 ratings_count_3 ratings_count_2 ratings_count_1
2 599 624 78 357 4
3 350 407 95 382 255
4 454 368 52 245 512
5 729 938 520 145 478
6 548 176 431 313 459
7 628 1 1 1 2
Does anyone know how I could find the median rating?
答案 0 :(得分:2)
这完全取决于ID
列中的值是什么意思以及如何定义中位数。我做了以下假设:
ID
是product-id ..._count_i
表示值i的评级然后你得到"中位数"通过:
df$sum = apply(df, 1, sum)
df$median = (df$ratings_count_5 * 5 + df$ratings_count_4 * 4 +
df$ratings_count_3 * 3 + ratings_count_2 * 2 +
df$ratings_count_1 * 1) / df$sum
答案 1 :(得分:1)
如何将数据框转换为矩阵(如果不是矩阵),然后使用times
的{{1}}参数应用加权中位数。
让我们调用原始数据rep()
:
df
我认为这会给你想要的输出。
mat <- as.matrix(df[, -1])
median_rating <- apply(mat, 1, function(x) median(rep(5:1, times=x)))
cbind(df, median_rating)
答案 2 :(得分:0)
您可以执行以下操作来获取每行的列索引
ID <- c(2,3)
ratings_count_5 <- c(599,350)
ratings_count_4 <- c(624,407)
ratings_count_3 <- c(78,95)
ratings_count_2 <- c(357,382)
ratings_count_1 <- c(4,255)
df <- data.frame(ID,ratings_count_5,ratings_count_4,ratings_count_3,ratings_count_2,ratings_count_1)
df$median <- median(unname(unlist(df[,-1])))
r <- df[,2:6]-df[,7]
index <- data.frame(NULL)
for(i in 1:nrow(r)){
ind <- which.min(unlist(abs(r[i,])))
index <- rbind(index,ind)
}
df <- cbind(df,index)
setnames(df,"X4L","col_index")
df
答案 3 :(得分:0)
<强>解决方案强>
严重依赖dplyr
:
library(dplyr)
library(tidyr)
df %>%
gather(rating, freq, -ID) %>%
arrange(rating) %>%
group_by(ID) %>%
mutate(cum_dist = cumsum(freq) / sum(freq),
past_half = cum_dist >= 0.5) %>%
filter(past_half) %>%
top_n(-1, cum_dist) %>%
select(ID, rating) %>%
arrange(ID)
<强>结果强>
ID rating
<dbl> <chr>
1 2 ratings_count_4
2 3 ratings_count_4
3 4 ratings_count_4
4 5 ratings_count_4
5 6 ratings_count_3
6 7 ratings_count_5
注意强>
我使用以下代码生成df
。在将来,我建议包括这样的内容,以便用户轻松复制。
df <- data.frame(
ID = c(2, 3, 4, 5, 6, 7),
ratings_count_5 = c(599, 350, 454, 729, 548, 628),
ratings_count_4 = c(624, 407, 368, 938, 176, 1),
ratings_count_3 = c(78, 95, 52, 520, 431, 1),
ratings_count_2 = c(357, 382, 245, 145, 313, 1),
ratings_count_1 = c(4, 255, 512, 478, 459, 2))