使用grepl

时间:2019-03-01 20:01:26

标签: r for-loop filter grepl

我有一个问题;我想对“ pexl07”中列出的每个字符模式的数据框“ data01”中的“ Pair_1”列进行过滤,直到“ Pair_4”。

数据框data01看起来像这样:

               Pair_1                     Pair_2                     Pair_3                   Pair_4
453  lupinespringcereal       grasscloverleyquinoa   springcerealspringcereal         camelinacamelina
1073 lupinespringcereal     grasscloverleycamelina   springcerealspringcereal             quinoaquinoa
1330 lupinespringcereal     grasscloverleycamelina         quinoaspringcereal             lupinequinoa
1373 lupinespringcereal     grasscloverleycamelina               quinoaquinoa       lupinespringcereal
1698     lupinecamelina grasscloverleyspringcereal               quinoaquinoa springcerealspringcereal
1910 lupinespringcereal       springcerealcamelina grasscloverleyspringcereal             lupinequinoa
1947 lupinespringcereal       springcerealcamelina       grasscloverleyquinoa       lupinespringcereal
1979 lupinespringcereal         springcerealquinoa grasscloverleyspringcereal           lupinecamelina
2141       lupinequinoa   springcerealspringcereal     grasscloverleycamelina       lupinespringcereal
2745     lupinecamelina   springcerealspringcereal       grasscloverleyquinoa springcerealspringcereal

Pexl07看起来像这样(出于示例目的):

                       V1
1             quinoaquinoa
2 springcerealspringcereal

我尝试了许多不同的方法,使用for(),filter(),subset(),grepl.sub()和grepl(),但我无法使其正常工作,可能是因为我不了解用循环索引。也欢迎不带循环的选项。

这部分内容适用于单列和单一模式:

data02 <- filter(data01, !grepl(paste(pexl07[1 , 1]), paste(data01[ ,1 ])) 

但是,对于pexl07下和data01所有列上的所有表达式,我如何使其自动工作?

我尝试了一些变化,但是没有返回我想要的:

for (j in ncol(data01))  {
  for (i in 1:nrow(pexl07)) {
    data02 <- filter(data01,
                         !grepl(paste(pexl07[j, ]), paste(data01[ ,i]))) 
  } 
} 

要清楚,我希望它最终像这样:

                 Pair_1                 Pair_2                     Pair_3             Pair_4
1330 lupinespringcereal grasscloverleycamelina         quinoaspringcereal       lupinequinoa
1910 lupinespringcereal   springcerealcamelina grasscloverleyspringcereal       lupinequinoa
1947 lupinespringcereal   springcerealcamelina       grasscloverleyquinoa lupinespringcereal
1979 lupinespringcereal     springcerealquinoa grasscloverleyspringcereal     lupinecamelina

具有dput:

structure(list(Pair_1 = structure(c(6L, 6L, 6L, 6L), .Label = c("grasscloverleycamelina", 
"grasscloverleyquinoa", "lupinecamelina", "lupinegrasscloverley", 
"lupinequinoa", "lupinespringcereal"), class = "factor"), Pair_2 = structure(c(3L, 
9L, 9L, 11L), .Label = c("camelinacamelina", "camelinagrasscloverley", 
"grasscloverleycamelina", "grasscloverleyquinoa", "grasscloverleyspringcereal", 
"quinoagrasscloverley", "quinoaquinoa", "quinoaspringcereal", 
"springcerealcamelina", "springcerealgrasscloverley", "springcerealquinoa", 
"springcerealspringcereal"), class = "factor"), Pair_3 = structure(c(11L, 
7L, 6L, 7L), .Label = c("camelinacamelina", "camelinagrasscloverley", 
"camelinaquinoa", "camelinaspringcereal", "grasscloverleycamelina", 
"grasscloverleyquinoa", "grasscloverleyspringcereal", "quinoacamelina", 
"quinoagrasscloverley", "quinoaquinoa", "quinoaspringcereal", 
"springcerealcamelina", "springcerealquinoa", "springcerealspringcereal"
), class = "factor"), Pair_4 = structure(c(6L, 6L, 7L, 5L), .Label = c("camelinacamelina", 
"camelinagrasscloverley", "grasscloverleycamelina", "grasscloverleyspringcereal", 
"lupinecamelina", "lupinequinoa", "lupinespringcereal", "quinoagrasscloverley", 
"quinoaquinoa", "quinoaspringcereal", "springcerealcamelina", 
"springcerealquinoa", "springcerealspringcereal"), class = "factor")), row.names = c(1330L, 
1910L, 1947L, 1979L), class = "data.frame")

dput pexl07:

structure(list(V1 = structure(1:2, .Label = c("quinoaquinoa", 
"springcerealspringcereal"), class = "factor")), row.names = 1:2, class = "data.frame")

put data01:

  structure(list(Pair_1 = structure(c(6L, 6L, 6L, 6L, 3L, 6L), .Label = c("grasscloverleycamelina", 
    "grasscloverleyquinoa", "lupinecamelina", "lupinegrasscloverley", 
    "lupinequinoa", "lupinespringcereal"), class = "factor"), Pair_2 = structure(c(4L, 
    3L, 3L, 3L, 5L, 9L), .Label = c("camelinacamelina", "camelinagrasscloverley", 
    "grasscloverleycamelina", "grasscloverleyquinoa", "grasscloverleyspringcereal", 
    "quinoagrasscloverley", "quinoaquinoa", "quinoaspringcereal", 
    "springcerealcamelina", "springcerealgrasscloverley", "springcerealquinoa", 
    "springcerealspringcereal"), class = "factor"), Pair_3 = structure(c(14L, 
    14L, 11L, 10L, 10L, 7L), .Label = c("camelinacamelina", "camelinagrasscloverley", 
    "camelinaquinoa", "camelinaspringcereal", "grasscloverleycamelina", 
    "grasscloverleyquinoa", "grasscloverleyspringcereal", "quinoacamelina", 
    "quinoagrasscloverley", "quinoaquinoa", "quinoaspringcereal", 
    "springcerealcamelina", "springcerealquinoa", "springcerealspringcereal"
    ), class = "factor"), Pair_4 = structure(c(1L, 9L, 6L, 7L, 13L, 
    6L), .Label = c("camelinacamelina", "camelinagrasscloverley", 
    "grasscloverleycamelina", "grasscloverleyspringcereal", "lupinecamelina", 
    "lupinequinoa", "lupinespringcereal", "quinoagrasscloverley", 
    "quinoaquinoa", "quinoaspringcereal", "springcerealcamelina", 
    "springcerealquinoa", "springcerealspringcereal"), class = "factor")), row.names = c(453L, 
    1073L, 1330L, 1373L, 1698L, 1910L), class = "data.frame")

1 个答案:

答案 0 :(得分:0)

更新了我的答案 如果我现在对您的理解正确,则希望删除该观察值。 在R中,用NA的缺失值表示。 与其将要删除的内容存储在数据框中,不如将其存储在向量中,这样更易​​于在过滤器中使用。

如果您想删除整行,请告诉我,将不得不考虑稍微不同的解决方案

我正在使用tidyverse来实现您想要的。 下面的代码

#convert pexl07 to a vector
pexl07 <-pexl07$V1
library(tidyr)
data01 %>%
gather(pair,cereal) %>%
group_by(pair) %>%
mutate(index = row_number()) %>%
mutate(cereal = ifelse(cereal %in% pexl07,NA,cereal)) %>%
spread(pair,cereal)

您并不是真正的过滤器,而是通过用空格替换字符来过滤掉。 因此,我用等于的“”(空白)替换了数据框中的列 pexl07中的表达式之一的条件。 使用gsub和regularexpressions(regex)来做到这一点。 阅读?gsub。 我正在使用sapply,这将应用于每列

    sapply(data01,function(col)     gsub(“ quinoaquinoa | springcerealspringcereal”,“”,col))

,如果您使用的是dplyr,则.表示该列。 mutate_all将突变所有列。

    #dplyr版本     数据01%>%     mutate_all(funs(gsub(“ quinoaquinoa | springcerealspringcereal”,“” ,.)))