R中大型数据集中多个列的自动grep()

时间:2016-08-13 19:35:42

标签: r regex grep

编辑底部的可重复示例...

我正在使用大型数据集(来自CDC的汇集的NHAMCS):

> dim(ed0509) [1] 174020 514

根据矢量列表,我在根据多个列变量grep() DIAG1 DIAG2中的模式使用DIAG3识别数据框中的行时遇到问题感兴趣SSTI.list。条件是如果在任一列变量中识别出这种模式,那么我想拉出该行号以最终使用它来对数据进行子集化,以在数据集中创建新的分类列SSTI.cat(0或1)。

SSTI.list <- c("035", "566", "60883", "6110", "6752", "6751", "680","681","682","683","684","684","685","686", "7048", "70583","7070", "7078", "7079", "7071", "7280", "72886", "7714", "7715", "7854", "9583", "99662", "99762", "9985")

由于我正在处理一个非常长的列表&gt; 1000个元素,我试图使用for循环自动执行此过程。所需的输出具有新变量,其中包含向量SSTI.list中每个值的行列表。我主要在grep()内运行for循环时遇到问题,我收到错误:

argument 'pattern' has length > 1 and only the first element will be used

到目前为止,我试图做的是:

diags <- c(ed0509$DIAG1,ed0509$DIAG2,ed0509$DIAG3)

for (i in SSTI.list){ assign(paste("var",i,sep=""),grep(paste("^",i,"",sep=""),diags,value=F)) }

SSTI.comb将是最终的行列表(所有var i),它们从for循环中标识SSTI.list中的模式,用于创建分类变量{{ 1}}

然后使用SSTI.cat包创建分类变量。

data.table

SSTI.comb<-sort(as.numeric(SSTI.comb))

编辑表示可重复性,对不起......

setDT(ed0509)[SSTI.comb,SSTI.cat:=1][,SSTI.cat:=0]

从概念上讲,我希望有一个输出,其中附加到DIAG1=c("00000","4659-","0356-","5664-","771--","7715-","78791") DIAG2=c("3829-","00000","00000","4659-","7854-","00000","566--") DIAG3=c("9985-","00000","00000","00000","00000","00000","00000") df<-data.frame(DIAG1,DIAG2,DIAG3)` SSTI.list <- c("035","9985","7854","771","7715") for (i in SSTI.list){ assign(paste("var",i,sep=""),grep(paste("^",i,"",sep=""),diags,value=F)) } 的新列变量将指示第1行,第3行,第5行和第6行被识别为满足df中指示的模式

SSTI.list

1 个答案:

答案 0 :(得分:1)

以下是我在添加数据之前编写的假数据示例。如果这是您的想法,请告诉我:

SSTI.list <- c("035", "566", "60883", "6110", "6752", "6751", "680","681","682","683","684","684",
               "685","686", "7048", "70583","7070", "7078", "7079", "7071", "7280", "72886", 
               "7714", "7715", "7854", "9583", "99662", "99762", "9985")

# Fake data
set.seed(10)
dat = as.data.frame(replicate(5, sample(c(SSTI.list, 1e5:(1e5+1000)),10)), stringsAsFactors=FALSE)
       V1     V2     V3     V4     V5
1  100493 100642 100861 100522 100254
2  100286 100555 100604 100066 100206
3  100409 100087 100767 100145   7048
4  100682 100583 100336 100895 100719
5  100058 100338 100387 100404 100227
6  100202 100410 100695 100737 100136
7  100252 100024 100829 100813   7078
8  100249 100241 100216 100947 100468
9  100600 100378 100758 100671 100076
10 100998 100824 100334 100482 100789
# Match any instance of a pattern within any element of the data
dat[apply(dat, 1, function(i) any(grepl(paste(SSTI.list, collapse="|"), i))),]
      V1     V2     V3     V4     V5
3 100409 100087 100767 100145   7048
4 100682 100583 100336 100895 100719  # "100682 matches "682" in SSTI.list
7 100252 100024 100829 100813   7078
# Match only if a data element is exactly the same as one of the patterns.
dat[apply(dat, 1, function(i) any(grepl(paste(paste0("^",SSTI.list,"$"), collapse="|"), i))),]
      V1     V2     V3     V4   V5
3 100409 100087 100767 100145 7048
7 100252 100024 100829 100813 7078

如果您只想要匹配行的行索引:

which(apply(dat, 1, function(i) any(grepl(paste(SSTI.list, collapse="|"), i))))

[1] 3 4 7