我的表格如下:
Table1 <- data.frame(
"Random" = c("A", "B", "C"),
"Genes" = c("Apple", "Candy", "Toothpaste"),
"Extra" = c("Up", "", "Down"),
"Desc" = c("Healthy,Red,Fruit", "Sweet,Cavities,Sugar,Fruity", "Minty,Dentist")
)
,并提供:
Random Genes Extra Desc
1 A Apple Up Healthy,Red,Fruit
2 B Candy Sweet,Cavities,Sugar,Fruity
3 C Toothpaste Down Minty,Dentist
我有另一个包含描述的表,并希望添加一个包含Genes的列。例如,Table2将是:
Table2 <- data.frame(
"Col1" = c(1, 2, 3, 4, 5, 6),
"Desc" = c("Sweet", "Sugar", "Dentist", "Red", "Fruit", "Fruity")
)
,并提供:
Col1 Desc
1 1 Sweet
2 2 Sugar
3 3 Dentist
4 4 Red
5 5 Fruit
6 6 Fruity
我想在Table2中添加一个名为“Genes”的列,它与两个表中的“Desc”相匹配,并将Table1中的Genes添加到:
Col1 Desc Gene
1 1 Sweet Candy
2 2 Sugar Candy
3 3 Dentist Toothpaste
4 4 Red Apple
5 5 Fruit Apple
6 6 Fruity Candy
答案 0 :(得分:8)
您可以尝试cSplit
中的splitstackshape
来拆分“Table1”中的“Desc”列,并将数据集从“wide”转换为“long”格式。输出将是data.table
。我们可以使用data.table
方法将键列设置为'Desc'(setkey
),与“Table2”连接,最后通过选择列来删除输出中不需要的列或者将(:=
)不需要的列分配给NULL
library(splitstackshape)
setkey(cSplit(Table1, 'Desc', ',', 'long'),Desc)[Table2[2:1]][
,c(5,4,2), with=FALSE]
# Col1 Desc Genes
#1: 1 Sweet Candy
#2: 2 Sugar Candy
#3: 3 Dentist Toothpaste
#4: 4 Red Apple
#5: 5 Fruit Apple
#6: 6 Fruity Candy
答案 1 :(得分:5)
以下是基本R中使用中间链接表的方法:
# create an intermediate data.frame with all the key (Desc) / value (Gene) pairs
df <- NULL
for(i in seq(nrow(Table1)))
df <- rbind(df,
data.frame(Gene =Table1$Genes[i],
Desc =strsplit(as.character(Table1$Desc)[i],',')[[1]]))
df
#> Gene Desc
#> 1 Apple Healthy
#> 2 Apple Red
#> 3 Apple Fruit
#> 4 Candy Sweet
#> 5 Candy Cavities
#> 6 Candy Sugar
#> 7 Candy Fruity
#> 8 Toothpaste Minty
#> 9 Toothpaste Dentist
现在以通常的方式链接到它:
Table2$Gene <- df$Gene[match(Table2$Desc,df$Desc)]
答案 2 :(得分:4)
如果我们可以对命名列表或2个向量(例如,2列数据帧)进行密钥查找,我们可以使用我维护的* {qdapTools **包中的%l%
函数。首先,我会使用Table1$desc
函数将您的strsplit
拆分为命名列表。那是关键。我们可以通过Table2$Desc
进行查找。这在后端使用* data.table **包,所以它非常快:
library(qdapTools)
key <- setNames(strsplit(as.character(Table1[["Desc"]]), "\\s*,\\s*"), Table1[["Genes"]])
## $Apple
## [1] "Healthy" "Red" "Fruit"
##
## $Candy
## [1] "Sweet" "Cavities" "Sugar" "Fruity"
##
## $Toothpaste
## [1] "Minty" "Dentist"
Table2[["Gene"]] <- Table2[["Desc"]] %l% key
## Col1 Desc Gene
## 1 1 Sweet Candy
## 2 2 Sugar Candy
## 3 3 Dentist Toothpaste
## 4 4 Red Apple
## 5 5 Fruit Apple
## 6 6 Fruity Candy
这里是一个纯粹的基本向量查找,也应该非常快速:
x <- strsplit(as.character(Table1[["Desc"]]), "\\s*,\\s*")
key <- setNames(rep(Table1[["Genes"]], sapply(x, length)), unlist(x))
Table2[["Gene"]] <- key[match(Table2[["Desc"]], names(key))]
答案 3 :(得分:3)
假设每个字符串都是唯一的(即Fruit不能出现在多个Gene中),您可以使用for
循环和grep
相当容易地完成此操作。但是,对于庞大的数据集,它可能会很慢。
options(stringsAsFactors = FALSE)
Table1 <- data.frame("Random" = c("A", "B", "C"), "Genes" = c("Apple", "Candy", "Toothpaste"), "Extra" = c("Up", "", "Down"), "Desc" = c("Healthy,Red,Fruit", "Sweet,Cavities,Sugar,Fruity", "Minty,Dentist"))
Table2 <- data.frame("Col1" = c(1, 2, 3, 4, 5, 6), "Desc" = c("Sweet", "Sugar", "Dentist", "Red", "Fruit", "Fruity"))
Table2$Gene <- NA
for(x in 1:nrow(Table2)) {
Table2[x,"Gene"] <- Table1$Genes[grep(pattern = paste("\\b",Table2$Desc[x],"\\b",sep=""),x = Table1$Desc)]
}
Table2
Col1 Desc Gene
1 1 Sweet Candy
2 2 Sugar Candy
3 3 Dentist Toothpaste
4 4 Red Apple
5 5 Fruit Apple
6 6 Fruity Candy
答案 4 :(得分:3)
按照@ TylerRinker的回答,我首先格式化Table1$Desc
字符串:
Table1a <- with(Table1,
stack(setNames(sapply(as.character(Desc),strsplit,split=","),Genes)))
names(Table1a) <- c("Desc","Genes")
然后转到data.table
:
require(data.table)
DT1 <- data.table(Table1a,key="Desc")
DT2 <- data.table(Table2,key="Desc")
然后merge-n-define:
DT2[DT1,Gene:=Genes]
# Col1 Desc Gene
# 1: 3 Dentist Toothpaste
# 2: 5 Fruit Apple
# 3: 6 Fruity Candy
# 4: 4 Red Apple
# 5: 2 Sugar Candy
# 6: 1 Sweet Candy
答案 5 :(得分:0)
假设没有太多要匹配的术语,这是一个使用某些tidyverse
函数的选项:
library(tidyverse)
crossing(Table1, Table2) %>%
mutate_if(is.factor, as.character) %>%
rowwise() %>%
filter(str_detect(Desc, Desc1)) %>%
select(Col1, Desc = Desc1, Genes) %>%
arrange(Col1)
# A tibble: 7 x 3
Col1 Desc Genes
<dbl> <chr> <chr>
1 1 Sweet Candy
2 2 Sugar Candy
3 3 Dentist Toothpaste
4 4 Red Apple
5 5 Fruit Apple
6 5 Fruit Candy
7 6 Fruity Candy