根据具有所有数字的列(item_id)值过滤数据帧?

时间:2017-04-01 04:10:17

标签: r dataframe pattern-matching

以下是R -

中的示例数据框
date                  item_id           price
2010-09-15            0034              4546
2010-09-15            ABXC              4325
2010-09-15            12AB              3545
2010-09-15            ZF9C              4354
2010-09-15            Z923              7854
2010-09-15            923F              780

期望的输出 -

date                  item_id           price
2010-09-15            ABXC              4325
2010-09-15            12AB              3545
2010-09-15            ZF9C              4354
2010-09-15            Z923              7854
2010-09-15            923F              780

我到目前为止尝试过 -

outlier_seq<-c('0','1','2','3','4','5','6','7','8','9')
df1<-sample_df[!grepl(paste(outlier_seq, collapse = "|"), sample$item_id),]

但这是删除所有item_id编号。而不是只是我想过滤掉那些其item_id由所有数字组成的记录。对此有何帮助?

感谢

2 个答案:

答案 0 :(得分:3)

假设你开始于:

mydf <- structure(list(date = c("2010-09-15", "2010-09-15", "2010-09-15", 
    "2010-09-15", "2010-09-15"), item_id = c("0034", "ABXC", "12AB", 
    "ZF9C", "ZF9C23"), price = c(4546L, 4325L, 3545L, 4354L, 7854L
    )), .Names = c("date", "item_id", "price"), row.names = c(NA, 
    5L), class = "data.frame")

你应该能够做到:

mydf[!grepl("^[0-9]", mydf$item_id), ]
##         date item_id price
## 2 2010-09-15    ABXC  4325
## 4 2010-09-15    ZF9C  4354
## 5 2010-09-15  ZF9C23  7854

答案 1 :(得分:0)

或者我们可以使用tidyverse来匹配以^开头([^0-9]+)与一个或多个非数字(str_detect)字符的模式,以返回逻辑向filter

的向量
library(dplyr)
library(stringr)
mydf %>% 
    filter(str_detect(item_id, "^[^0-9]+"))
#        date item_id price
#1 2010-09-15    ABXC  4325
#2 2010-09-15    ZF9C  4354
#3 2010-09-15  ZF9C23  7854

更新

对于OP帖子中的更新问题,我们可以查找从字符串的开头([0-9]+)到结尾(^)有一个或多个数字($)的模式,否定(!)逻辑向量以将TRUE/FALSE反转为FALSE/TRUEfilter

mydf %>%
       filter(!str_detect(item_id, "^[0-9]+$"))
#        date item_id price
#1 2010-09-15    ABXC  4325
#2 2010-09-15    12AB  3545
#3 2010-09-15    ZF9C  4354
#4 2010-09-15  ZF9C23  7854

UPDATE2

基于OP担心它正在过滤掉“07R2”,通过添加具有该值的另一​​行来测试它

mydf %>% 
     filter(!str_detect(item_id, "^[0-9]+$"))
 #        date item_id price
 #1 2010-09-15    ABXC  4325
 #2 2010-09-15    12AB  3545
 #3 2010-09-15    ZF9C  4354
 #4 2010-09-15  ZF9C23  7854
 #5 2010-09-15    07R2  7934

UPDATE3

基于OP的新数据集

mydf %>% 
     filter(!str_detect(item_id, "^[0-9]+$"))
#        date item_id price
#1 2010-09-15    ABXC  4325
#2 2010-09-15    12AB  3545
#3 2010-09-15    ZF9C  4354
#4 2010-09-15    Z923  7854
#5 2010-09-15    923F   780

即使该列为factor,也可以

mydf %>%
      filter(!str_detect(factor(item_id), "^[0-9]+$"))
#        date item_id price
#1 2010-09-15    ABXC  4325
#2 2010-09-15    12AB  3545
#3 2010-09-15    ZF9C  4354
#4 2010-09-15    Z923  7854
#5 2010-09-15    923F   780

数据

#data from last update
mydf <- structure(list(date = c("2010-09-15", "2010-09-15", "2010-09-15", 
"2010-09-15", "2010-09-15", "2010-09-15"), item_id = c("0034", 
"ABXC", "12AB", "ZF9C", "Z923", "923F"), price = c(4546L, 4325L, 
3545L, 4354L, 7854L, 780L)), .Names = c("date", "item_id", "price"
 ), class = "data.frame", row.names = c(NA, -6L))