我有一个问题;我想对“ pexl07”中列出的每个字符模式的数据框“ data01”中的“ Pair_1”列进行过滤,直到“ Pair_4”。
数据框data01看起来像这样:
Pair_1 Pair_2 Pair_3 Pair_4
453 lupinespringcereal grasscloverleyquinoa springcerealspringcereal camelinacamelina
1073 lupinespringcereal grasscloverleycamelina springcerealspringcereal quinoaquinoa
1330 lupinespringcereal grasscloverleycamelina quinoaspringcereal lupinequinoa
1373 lupinespringcereal grasscloverleycamelina quinoaquinoa lupinespringcereal
1698 lupinecamelina grasscloverleyspringcereal quinoaquinoa springcerealspringcereal
1910 lupinespringcereal springcerealcamelina grasscloverleyspringcereal lupinequinoa
1947 lupinespringcereal springcerealcamelina grasscloverleyquinoa lupinespringcereal
1979 lupinespringcereal springcerealquinoa grasscloverleyspringcereal lupinecamelina
2141 lupinequinoa springcerealspringcereal grasscloverleycamelina lupinespringcereal
2745 lupinecamelina springcerealspringcereal grasscloverleyquinoa springcerealspringcereal
Pexl07看起来像这样(出于示例目的):
V1
1 quinoaquinoa
2 springcerealspringcereal
我尝试了许多不同的方法,使用for(),filter(),subset(),grepl.sub()和grepl(),但我无法使其正常工作,可能是因为我不了解用循环索引。也欢迎不带循环的选项。
这部分内容适用于单列和单一模式:
data02 <- filter(data01, !grepl(paste(pexl07[1 , 1]), paste(data01[ ,1 ]))
但是,对于pexl07下和data01所有列上的所有表达式,我如何使其自动工作?
我尝试了一些变化,但是没有返回我想要的:
for (j in ncol(data01)) {
for (i in 1:nrow(pexl07)) {
data02 <- filter(data01,
!grepl(paste(pexl07[j, ]), paste(data01[ ,i])))
}
}
要清楚,我希望它最终像这样:
Pair_1 Pair_2 Pair_3 Pair_4
1330 lupinespringcereal grasscloverleycamelina quinoaspringcereal lupinequinoa
1910 lupinespringcereal springcerealcamelina grasscloverleyspringcereal lupinequinoa
1947 lupinespringcereal springcerealcamelina grasscloverleyquinoa lupinespringcereal
1979 lupinespringcereal springcerealquinoa grasscloverleyspringcereal lupinecamelina
具有dput:
structure(list(Pair_1 = structure(c(6L, 6L, 6L, 6L), .Label = c("grasscloverleycamelina",
"grasscloverleyquinoa", "lupinecamelina", "lupinegrasscloverley",
"lupinequinoa", "lupinespringcereal"), class = "factor"), Pair_2 = structure(c(3L,
9L, 9L, 11L), .Label = c("camelinacamelina", "camelinagrasscloverley",
"grasscloverleycamelina", "grasscloverleyquinoa", "grasscloverleyspringcereal",
"quinoagrasscloverley", "quinoaquinoa", "quinoaspringcereal",
"springcerealcamelina", "springcerealgrasscloverley", "springcerealquinoa",
"springcerealspringcereal"), class = "factor"), Pair_3 = structure(c(11L,
7L, 6L, 7L), .Label = c("camelinacamelina", "camelinagrasscloverley",
"camelinaquinoa", "camelinaspringcereal", "grasscloverleycamelina",
"grasscloverleyquinoa", "grasscloverleyspringcereal", "quinoacamelina",
"quinoagrasscloverley", "quinoaquinoa", "quinoaspringcereal",
"springcerealcamelina", "springcerealquinoa", "springcerealspringcereal"
), class = "factor"), Pair_4 = structure(c(6L, 6L, 7L, 5L), .Label = c("camelinacamelina",
"camelinagrasscloverley", "grasscloverleycamelina", "grasscloverleyspringcereal",
"lupinecamelina", "lupinequinoa", "lupinespringcereal", "quinoagrasscloverley",
"quinoaquinoa", "quinoaspringcereal", "springcerealcamelina",
"springcerealquinoa", "springcerealspringcereal"), class = "factor")), row.names = c(1330L,
1910L, 1947L, 1979L), class = "data.frame")
dput pexl07:
structure(list(V1 = structure(1:2, .Label = c("quinoaquinoa",
"springcerealspringcereal"), class = "factor")), row.names = 1:2, class = "data.frame")
put data01:
structure(list(Pair_1 = structure(c(6L, 6L, 6L, 6L, 3L, 6L), .Label = c("grasscloverleycamelina",
"grasscloverleyquinoa", "lupinecamelina", "lupinegrasscloverley",
"lupinequinoa", "lupinespringcereal"), class = "factor"), Pair_2 = structure(c(4L,
3L, 3L, 3L, 5L, 9L), .Label = c("camelinacamelina", "camelinagrasscloverley",
"grasscloverleycamelina", "grasscloverleyquinoa", "grasscloverleyspringcereal",
"quinoagrasscloverley", "quinoaquinoa", "quinoaspringcereal",
"springcerealcamelina", "springcerealgrasscloverley", "springcerealquinoa",
"springcerealspringcereal"), class = "factor"), Pair_3 = structure(c(14L,
14L, 11L, 10L, 10L, 7L), .Label = c("camelinacamelina", "camelinagrasscloverley",
"camelinaquinoa", "camelinaspringcereal", "grasscloverleycamelina",
"grasscloverleyquinoa", "grasscloverleyspringcereal", "quinoacamelina",
"quinoagrasscloverley", "quinoaquinoa", "quinoaspringcereal",
"springcerealcamelina", "springcerealquinoa", "springcerealspringcereal"
), class = "factor"), Pair_4 = structure(c(1L, 9L, 6L, 7L, 13L,
6L), .Label = c("camelinacamelina", "camelinagrasscloverley",
"grasscloverleycamelina", "grasscloverleyspringcereal", "lupinecamelina",
"lupinequinoa", "lupinespringcereal", "quinoagrasscloverley",
"quinoaquinoa", "quinoaspringcereal", "springcerealcamelina",
"springcerealquinoa", "springcerealspringcereal"), class = "factor")), row.names = c(453L,
1073L, 1330L, 1373L, 1698L, 1910L), class = "data.frame")
答案 0 :(得分:0)
更新了我的答案
如果我现在对您的理解正确,则希望删除该观察值。
在R中,用NA
的缺失值表示。
与其将要删除的内容存储在数据框中,不如将其存储在向量中,这样更易于在过滤器中使用。
如果您想删除整行,请告诉我,将不得不考虑稍微不同的解决方案
我正在使用tidyverse来实现您想要的。 下面的代码
#convert pexl07 to a vector
pexl07 <-pexl07$V1
library(tidyr)
data01 %>%
gather(pair,cereal) %>%
group_by(pair) %>%
mutate(index = row_number()) %>%
mutate(cereal = ifelse(cereal %in% pexl07,NA,cereal)) %>%
spread(pair,cereal)
您并不是真正的过滤器,而是通过用空格替换字符来过滤掉。
因此,我用等于的“”(空白)替换了数据框中的列
pexl07中的表达式之一的条件。
使用gsub和regularexpressions(regex)来做到这一点。
阅读 ?gsub
。
我正在使用sapply,这将应用于每列
,如果您使用的是dplyr,则 .
表示该列。
mutate_all将突变所有列。