Question

我的表格如下：

Table1 <- data.frame(
    "Random" = c("A", "B", "C"), 
    "Genes" = c("Apple", "Candy", "Toothpaste"), 
    "Extra" = c("Up", "", "Down"), 
    "Desc" = c("Healthy,Red,Fruit", "Sweet,Cavities,Sugar,Fruity", "Minty,Dentist")
)

，并提供：

  Random      Genes Extra                       Desc
1      A      Apple    Up          Healthy,Red,Fruit
2      B      Candy       Sweet,Cavities,Sugar,Fruity
3      C Toothpaste  Down              Minty,Dentist

我有另一个包含描述的表，并希望添加一个包含Genes的列。例如，Table2将是：

Table2 <- data.frame(
    "Col1" = c(1, 2, 3, 4, 5, 6), 
    "Desc" = c("Sweet", "Sugar", "Dentist", "Red", "Fruit", "Fruity")
)

，并提供：

  Col1    Desc
1    1   Sweet
2    2   Sugar
3    3 Dentist
4    4     Red
5    5   Fruit
6    6  Fruity

我想在Table2中添加一个名为“Genes”的列，它与两个表中的“Desc”相匹配，并将Table1中的Genes添加到：

  Col1    Desc    Gene
1    1   Sweet    Candy
2    2   Sugar    Candy
3    3 Dentist    Toothpaste
4    4     Red    Apple
5    5   Fruit    Apple
6    6  Fruity    Candy

Answer 1

您可以尝试cSplit中的splitstackshape来拆分“Table1”中的“Desc”列，并将数据集从“wide”转换为“long”格式。输出将是data.table。我们可以使用data.table方法将键列设置为'Desc'（setkey），与“Table2”连接，最后通过选择列来删除输出中不需要的列或者将（:=）不需要的列分配给NULL

library(splitstackshape)
setkey(cSplit(Table1, 'Desc', ',', 'long'),Desc)[Table2[2:1]][
                   ,c(5,4,2), with=FALSE]
#  Col1    Desc      Genes
#1:    1   Sweet      Candy
#2:    2   Sugar      Candy
#3:    3 Dentist Toothpaste
#4:    4     Red      Apple
#5:    5   Fruit      Apple
#6:    6  Fruity      Candy

Answer 2

以下是基本R中使用中间链接表的方法：

# create an intermediate data.frame with all the key (Desc) / value (Gene) pairs
df  <-  NULL
for(i in seq(nrow(Table1)))
    df  <-  rbind(df,
                  data.frame(Gene =Table1$Genes[i],
                            Desc =strsplit(as.character(Table1$Desc)[i],',')[[1]]))
df 
#>         Gene     Desc
#> 1      Apple  Healthy
#> 2      Apple      Red
#> 3      Apple    Fruit
#> 4      Candy    Sweet
#> 5      Candy Cavities
#> 6      Candy    Sugar
#> 7      Candy   Fruity
#> 8 Toothpaste    Minty
#> 9 Toothpaste  Dentist

现在以通常的方式链接到它：

Table2$Gene  <-  df$Gene[match(Table2$Desc,df$Desc)]

Answer 3

如果我们可以对命名列表或2个向量（例如，2列数据帧）进行密钥查找，我们可以使用我维护的* {qdapTools **包中的%l%函数。首先，我会使用Table1$desc函数将您的strsplit拆分为命名列表。那是关键。我们可以通过Table2$Desc进行查找。这在后端使用* data.table **包，所以它非常快：

library(qdapTools)

key <- setNames(strsplit(as.character(Table1[["Desc"]]), "\\s*,\\s*"), Table1[["Genes"]])

## $Apple
## [1] "Healthy" "Red"     "Fruit"  
## 
## $Candy
## [1] "Sweet"    "Cavities" "Sugar"    "Fruity"  
## 
## $Toothpaste
## [1] "Minty"   "Dentist"

Table2[["Gene"]] <- Table2[["Desc"]] %l% key

##   Col1    Desc       Gene
## 1    1   Sweet      Candy
## 2    2   Sugar      Candy
## 3    3 Dentist Toothpaste
## 4    4     Red      Apple
## 5    5   Fruit      Apple
## 6    6  Fruity      Candy

这里是一个纯粹的基本向量查找，也应该非常快速：

x <- strsplit(as.character(Table1[["Desc"]]), "\\s*,\\s*")
key <- setNames(rep(Table1[["Genes"]], sapply(x, length)), unlist(x))
Table2[["Gene"]] <- key[match(Table2[["Desc"]], names(key))]

Answer 4

假设每个字符串都是唯一的（即Fruit不能出现在多个Gene中），您可以使用for循环和grep相当容易地完成此操作。但是，对于庞大的数据集，它可能会很慢。

options(stringsAsFactors = FALSE)
Table1 <- data.frame("Random" = c("A", "B", "C"), "Genes" = c("Apple", "Candy", "Toothpaste"), "Extra" = c("Up", "", "Down"), "Desc" = c("Healthy,Red,Fruit", "Sweet,Cavities,Sugar,Fruity", "Minty,Dentist"))
Table2 <- data.frame("Col1" = c(1, 2, 3, 4, 5, 6), "Desc" = c("Sweet", "Sugar", "Dentist", "Red", "Fruit", "Fruity"))

Table2$Gene <- NA
for(x in 1:nrow(Table2)) {

    Table2[x,"Gene"] <- Table1$Genes[grep(pattern = paste("\\b",Table2$Desc[x],"\\b",sep=""),x = Table1$Desc)]
}
Table2

  Col1    Desc       Gene
1    1   Sweet      Candy
2    2   Sugar      Candy
3    3 Dentist Toothpaste
4    4     Red      Apple
5    5   Fruit      Apple
6    6  Fruity      Candy

Answer 5

按照@ TylerRinker的回答，我首先格式化Table1$Desc字符串：

Table1a        <- with(Table1,
                    stack(setNames(sapply(as.character(Desc),strsplit,split=","),Genes)))
names(Table1a) <- c("Desc","Genes")

然后转到data.table：

require(data.table)
DT1 <- data.table(Table1a,key="Desc")
DT2 <- data.table(Table2,key="Desc")

然后merge-n-define：

DT2[DT1,Gene:=Genes]
#    Col1    Desc       Gene
# 1:    3 Dentist Toothpaste
# 2:    5   Fruit      Apple
# 3:    6  Fruity      Candy
# 4:    4     Red      Apple
# 5:    2   Sugar      Candy
# 6:    1   Sweet      Candy

Answer 6

假设没有太多要匹配的术语，这是一个使用某些tidyverse函数的选项：

library(tidyverse)
crossing(Table1, Table2) %>% 
  mutate_if(is.factor, as.character) %>% 
  rowwise() %>% 
  filter(str_detect(Desc, Desc1)) %>% 
  select(Col1, Desc = Desc1, Genes) %>% 
  arrange(Col1)

# A tibble: 7 x 3
   Col1 Desc    Genes     
  <dbl> <chr>   <chr>     
1     1 Sweet   Candy     
2     2 Sugar   Candy     
3     3 Dentist Toothpaste
4     4 Red     Apple     
5     5 Fruit   Apple     
6     5 Fruit   Candy     
7     6 Fruity  Candy

使用另一个表中的数据向表添加列

6 个答案: