我有以下数据表:
RowID| Col1 | Col2 |
----------------------
1 | apple | cow |
2 | orange | dog |
3 | apple | cat |
4 | cherry | fish |
5 | cherry | ant |
6 | apple | rat |
我要到这张桌子:
RowID| Col1 | Col2 | newCol
------------------------------
1 | apple | cow | cat
2 | apple | cow | rat
3 | orange | dog | na
4 | apple | cat | cow
5 | apple | cat | rat
6 | cherry | fish | ant
7 | cherry | ant | fish
8 | apple | rat | cow
9 | apple | rat | cat
为帮助可视化上表的逻辑,它与下表基本相同,但是列表列根据显示的值分为几行。它与col1中的值匹配,因此,例如,原始表的第1 3和第6行在第一列中有“ apple”。因此,新的“列表”列将包含匹配行的所有Col2值。然后针对每个列表元素将其展开为新行。上面的第二张表是我想要的结果,这第三张表正是在这里帮助可视化值的来源。
RowID| Col1 | Col2 | newCol
------------------------------
1 | apple | cow | cat,rat (Row 3 & 6 match col1 values)
2 | orange | dog | na (No rows match this col1 value)
3 | apple | cat | cow,rat (Row 1 & 6 match col1 values)
4 | cherry | fish | ant (Row 5 matches col1 values)
5 | cherry | ant | fish (Row 4 matches col1 values)
6 | apple | rat | cow,cat (Row 1 & 3 match col1 values)
答案 0 :(得分:2)
使用data.table软件包:
library(data.table)
# option 1
setDT(dat)[, .SD[CJ(Col2 = Col2, newCol = Col2, unique = TRUE), on = .(Col2)]
, by = Col1
][order(RowID), .SD[Col2 != newCol | .N == 1], by = RowID]
# option 2
setDT(dat)[, newCol := paste0(Col2, collapse = ","), by = Col1
][, .(newCol = unlist(tstrsplit(newCol, ","))), by = .(RowID, Col1, Col2)
][, .SD[Col2 != newCol | .N == 1], by = RowID]
给出:
RowID Col1 Col2 newCol 1: 1 apple cow cat 2: 1 apple cow rat 3: 2 orange dog dog 4: 3 apple cat cow 5: 3 apple cat rat 6: 4 cherry fish ant 7: 5 cherry ant fish 8: 6 apple rat cow 9: 6 apple rat cat
等效tidyverse:
library(dplyr)
library(tidyr)
dat %>%
group_by(Col1) %>%
mutate(newCol = paste0(Col2, collapse = ",")) %>%
separate_rows(newCol) %>%
group_by(RowID) %>%
filter(Col2 != newCol | n() == 1)
答案 1 :(得分:0)
自我连接第一列上的表,摆脱NewCol等于Col2的行。困难的是将数据行中的行仅保留一次。
require(data.table)
require(magrittr)
dt_foo = data.table(Col1 = c("apple", "orange","apple","cherry",
"cherry", "apple"),
Col2 = c("cow","dog","cat","fish",
"ant","rat"))
# required to later set NA values
single_occ = dt_foo[, .N, Col1] %>%
.[N == 1, Col1]
dt_foo2 = dt_foo %>%
.[., on = "Col1", allow.cartesian = T] %>%
setnames("i.Col2", "NewCol") %>%
.[Col1 %in% single_occ, NewCol := NA] %>%
.[Col2 != NewCol | is.na(NewCol)]