我在R中有一个数据框,记录了许多不同品牌的客户排名偏好。数据框的示例类似于下表。实际的桌子在两个维度上都要大得多(大约80,000 x 30)。
我的表格:
+------+---------+---------+---------+---------+
| User | Brand_A | Brand_B | Brand_C | Brand_D |
+------+---------+---------+---------+---------+
| A | 1 | NA | 3 | 2 |
| B | NA | NA | NA | 1 |
| C | 3 | 2 | 4 | 1 |
| D | NA | 1 | 2 | NA |
+------+---------+---------+---------+---------+
其中1表示客户将品牌列为“最佳”,NA表示客户未对品牌进行排名。我想创建一个表格,为每个用户选择排名前3位(或前N位)的品牌,并输出一个可能如下所示的表格:
+------+---------+---------+---------+
| User | Ranked1 | Ranked2 | Ranked3 |
+------+---------+---------+---------+
| A | Brand_A | Brand_D | Brand_C |
| B | Brand_D | NA | NA |
| C | Brand_D | Brand_B | Brand_A |
| D | Brand_B | Brand_C | NA |
+------+---------+---------+---------+
假设每个客户排名都是详尽无遗的,即如果我只使用了一个品牌,那么该品牌会自动排名为1.
我尝试使用for循环来获取所需的输出,但没有成功。我认为我缺少一些相当简单的东西。
答案 0 :(得分:1)
一种选择是融化您的数据然后重铸它。使用data.table
的此选项如下所示:
library(data.table)
dcast(setDT(melt(data, id.vars = "user"))[, rank := paste0("Ranked",value)][!is.na(value),], user ~ rank, value.var = "variable")
# user Ranked1 Ranked2 Ranked3 Ranked4
#1 A Brand_A Brand_D Brand_C <NA>
#2 B Brand_D <NA> <NA> <NA>
#3 C Brand_D Brand_B Brand_A Brand_C
#4 D Brand_B Brand_C <NA> <NA>
答案 1 :(得分:1)
您可以使用apply
执行此操作:
df2=data.frame(User=df$User,t(apply(df,1,function(x) names(x)[-1][order(x[-1],na.last=NA)][1:3])))
colnames(df2)=c("User",paste0("Ranked",c(1:3)))
返回:
User Ranked1 Ranked2 Ranked3
1 A Brand_A Brand_D Brand_C
2 B Brand_D <NA> <NA>
3 C Brand_D Brand_B Brand_A
4 D Brand_B Brand_C <NA>
答案 2 :(得分:1)
与tidyverse
...
df <- read.table(header = T, text = '
User Brand_A Brand_B Brand_C Brand_D
A 1 NA 3 2
B NA NA NA 1
C 3 2 4 1
D NA 1 2 NA
')
library(tidyverse)
df %>%
gather(brand, rank, -User, na.rm = T) %>%
filter(rank < 4) %>%
spread(rank, brand, sep = '')
...生成
User rank1 rank2 rank3
1 A Brand_A Brand_D Brand_C
2 B Brand_D <NA> <NA>
3 C Brand_D Brand_B Brand_A
4 D Brand_B Brand_C <NA>