以下是R -
中的示例数据框date item_id price
2010-09-15 0034 4546
2010-09-15 ABXC 4325
2010-09-15 12AB 3545
2010-09-15 ZF9C 4354
2010-09-15 Z923 7854
2010-09-15 923F 780
期望的输出 -
date item_id price
2010-09-15 ABXC 4325
2010-09-15 12AB 3545
2010-09-15 ZF9C 4354
2010-09-15 Z923 7854
2010-09-15 923F 780
我到目前为止尝试过 -
outlier_seq<-c('0','1','2','3','4','5','6','7','8','9')
df1<-sample_df[!grepl(paste(outlier_seq, collapse = "|"), sample$item_id),]
但这是删除所有item_id编号。而不是只是我想过滤掉那些其item_id由所有数字组成的记录。对此有何帮助?
感谢
答案 0 :(得分:3)
假设你开始于:
mydf <- structure(list(date = c("2010-09-15", "2010-09-15", "2010-09-15",
"2010-09-15", "2010-09-15"), item_id = c("0034", "ABXC", "12AB",
"ZF9C", "ZF9C23"), price = c(4546L, 4325L, 3545L, 4354L, 7854L
)), .Names = c("date", "item_id", "price"), row.names = c(NA,
5L), class = "data.frame")
你应该能够做到:
mydf[!grepl("^[0-9]", mydf$item_id), ]
## date item_id price
## 2 2010-09-15 ABXC 4325
## 4 2010-09-15 ZF9C 4354
## 5 2010-09-15 ZF9C23 7854
答案 1 :(得分:0)
或者我们可以使用tidyverse
来匹配以^
开头([^0-9]+
)与一个或多个非数字(str_detect
)字符的模式,以返回逻辑向filter
行
library(dplyr)
library(stringr)
mydf %>%
filter(str_detect(item_id, "^[^0-9]+"))
# date item_id price
#1 2010-09-15 ABXC 4325
#2 2010-09-15 ZF9C 4354
#3 2010-09-15 ZF9C23 7854
对于OP帖子中的更新问题,我们可以查找从字符串的开头([0-9]+
)到结尾(^
)有一个或多个数字($
)的模式,否定(!
)逻辑向量以将TRUE/FALSE
反转为FALSE/TRUE
和filter
mydf %>%
filter(!str_detect(item_id, "^[0-9]+$"))
# date item_id price
#1 2010-09-15 ABXC 4325
#2 2010-09-15 12AB 3545
#3 2010-09-15 ZF9C 4354
#4 2010-09-15 ZF9C23 7854
基于OP担心它正在过滤掉“07R2”,通过添加具有该值的另一行来测试它
mydf %>%
filter(!str_detect(item_id, "^[0-9]+$"))
# date item_id price
#1 2010-09-15 ABXC 4325
#2 2010-09-15 12AB 3545
#3 2010-09-15 ZF9C 4354
#4 2010-09-15 ZF9C23 7854
#5 2010-09-15 07R2 7934
基于OP的新数据集
mydf %>%
filter(!str_detect(item_id, "^[0-9]+$"))
# date item_id price
#1 2010-09-15 ABXC 4325
#2 2010-09-15 12AB 3545
#3 2010-09-15 ZF9C 4354
#4 2010-09-15 Z923 7854
#5 2010-09-15 923F 780
即使该列为factor
,也可以
mydf %>%
filter(!str_detect(factor(item_id), "^[0-9]+$"))
# date item_id price
#1 2010-09-15 ABXC 4325
#2 2010-09-15 12AB 3545
#3 2010-09-15 ZF9C 4354
#4 2010-09-15 Z923 7854
#5 2010-09-15 923F 780
#data from last update
mydf <- structure(list(date = c("2010-09-15", "2010-09-15", "2010-09-15",
"2010-09-15", "2010-09-15", "2010-09-15"), item_id = c("0034",
"ABXC", "12AB", "ZF9C", "Z923", "923F"), price = c(4546L, 4325L,
3545L, 4354L, 7854L, 780L)), .Names = c("date", "item_id", "price"
), class = "data.frame", row.names = c(NA, -6L))