我在R中有以下数据框
my_df <- data.frame(V1 = c(1,2,3,1), V2 = c("A","B","C","A"), V3 = c("S1", "S1", "S1", "S2"), V4 = c("x","x","x","x"), V5 = c("y","y","y","y"), V6 =c("A", "B", "C", "D"))
> my_df
V1 V2 V3 V4 V5 V6
1 1 A S1 x y A
2 2 B S1 x y B
3 3 C S1 x y C
4 1 A S2 x y D
现在,我要检查V1和V2中的值组合是否在df中多次出现。在我的示例中,my_df行1和4具有相同的值“ 1 A”和“ 1 A”。如果发生这种情况,我需要以下输出:
> my_df_new
V1 V2 V3 V4 V5 V6_S1 V6_S2
1 1 A S1;S2 x y A D
2 2 B S1 x y B
3 3 C S1 x y C
所以基本上有两件事发生了变化:
其余的列和值应保持不变。
我该如何实现?
答案 0 :(得分:2)
这是使用dplyr
,group_by
V1
和V2
,折叠V3
,创建新列(V7
)的一种方法spread
个重复值。
library(dplyr)
my_df %>%
group_by(V1, V2) %>%
mutate(V3 = toString(V3),
V7 = paste0("V6_S", row_number())) %>%
tidyr::spread(V7, V6)
# V1 V2 V3 V4 V5 V6_S1 V6_S2
# <dbl> <fct> <chr> <fct> <fct> <fct> <fct>
#1 1 A S1, S2 x y A D
#2 2 B S1 x y B NA
#3 3 C S1 x y C NA
答案 1 :(得分:0)
应该有一种更简洁的方式来执行此操作,这不会强迫员工,但这是我想出的,
library(data.table)
library(splitstackshape)
cSplit(setDT(my_df)[, .(V3 = toString(V3),
V4 = V4[1],
V5 = V5[1],
V6 = toString(V6)), .(V1, V2)], 'V6')
# V1 V2 V3 V4 V5 V6_1 V6_2
#1: 1 A S1, S2 x y A D
#2: 2 B S1 x y B <NA>
#3: 3 C S1 x y C <NA>