Question

我希望这个问题对于这个委员会来说不是太简单。

我创建了一个data.frame df：

       CAS        Name        CID
89  13010-47-4   Lomustine         3950
90 130209-82-4   Latanoprost       5311221,5282380,46705340,3890
91 130636-43-0   Nifekalant        268083
92 130929-57-6   Entacapone        5281081

和矢量vec

[1] 5282380 18471829 45923789 44308022 44266812 24883465 24867475 24867460

我想提取包含任意数量vec的df行。我尝试通过以下代码解决此问题：

 df$GC[(df$CID %in% vec)] = 1

 df[df$GC==1,]

但是这个解决方案的问题是，我只获得了行，它们在CID列中只包含一个数字。不会出现在CID中包含多个值的行，如第90行。

这个问题有优雅的解决方案吗？

提前致谢

Answer 1

一种方法是使用grep（）：

> txt <- "       CAS        Name        CID
+   13010-47-4   Lomustine         3950
+  130209-82-4   Latanoprost       5311221,5282380,46705340,3890
+  130636-43-0   Nifekalant        268083
+  130929-57-6   Entacapone        5281081
+ "
> con <- textConnection(txt)
> df <- read.table(con, header = TRUE)
> close(con)
> ID <- c(5282380, 18471829, 45923789, 44308022, 44266812, 24883465, 24867475, 24867460, 3950)
> grep(paste("\\b", ID, "\\b", sep="", collapse = "|"), dat$CID)
[1] 1 2

Answer 2

鉴于你对EDi答案的评论（我喜欢），我想我会提出一个建议。

将逗号分隔值压缩到数据框的单个列中是很尴尬的（根据我的经验）只会导致沮丧。我经常发现将它保存在一个单独的数据结构中更简单：列表：

dat <- read.table(text = "       CAS        Name        CID
   13010-47-4   Lomustine         3950
  130209-82-4   Latanoprost       5311221,5282380,46705340,3890
  130636-43-0   Nifekalant        268083
  130929-57-6   Entacapone        5281081",sep = "",header = TRUE)

cid <- sapply(dat$CID,strsplit,",",USE.NAMES = FALSE)

在这种形式下，事情往往更容易使用：

ID <- c(5282380, 18471829, 45923789, 44308022, 44266812, 24883465, 24867475, 24867460, 3950)
dat[sapply(cid,function(x) {any(x %in% as.character(ID))}),]
          CAS        Name                           CID
1  13010-47-4   Lomustine                          3950
2 130209-82-4 Latanoprost 5311221,5282380,46705340,3890

如果您担心订单发生变化，您可以随时使用dat中的rownames和列表名称来保持每个项目的正确性。

（另请注意，我的匿名函数假设最终会通过R的范围规则找到ID;如果您愿意，可以更改函数以明确传递ID。）

data.frame切片

2 个答案: