删除满足条件的连续2行

时间:2018-09-24 14:29:04

标签: r

这是交易的链接。 这是餐厅支票取消交易的情况。

我希望R检查项目是否具有标记“ U”,然后删除U和1个不是u的类似项目。

我已将要删除的项目标记为黄色。

chk_num dtl_name    Duration    Guest   void_type   Item_ttl
9707    Americano           45  1       18
9707    Americano           45  1       18
9707    Breakfast Tea       45  1       18
9707    Breakfast Tea       45  1   U   -18
9707    Café Latte          45  1       21
9707    Camomille Tea       45  1   U   -18
9707    Camomille Tea       45  1       18
9707    Earl Grey Tea       45  1   U   -18
9707    Earl Grey Tea       45  1       18
9707    Fresh Mint Tea      45  1   U   -18
9707    Fresh Mint Tea      45  1       18
9707    Green Tea           45  1       18
9707    Green Tea           45  1   U   -18
9707    Green Tea           45  1       18
9707    Lemon Tea           45  1       18
9707    Lemon Tea           45  1   U   -18
9707    Orange Juice        45  1       24
9707    Pepper Mint Tea     45  1       18
9707    Pepper Mint Tea     45  1   U   -18

Data base Picture

2 个答案:

答案 0 :(得分:1)

使用软件包的替代解决方案:

# load the 'data.table'-package & convert 'DF' to a data.table
library(data.table)
setDT(DF)

# add a rownumber
DF[ , rn := .I][]

# create a subset with only the 'U'-rows and make 'Item_ttl' positive
DF_U <- DF[void_type == "U"][, Item_ttl := Item_ttl * -1][]

# create an index of rownumbers to be removed by:
# - extracting 'rn' from 'DF_U'
# - joining DF_U with DF
#   select only the first matching row in the join
#   and then extract 'rn'
# - concatenate these two vectors into one
ix <- c(DF_U$rn, DF[DF_U, on = .(chk_num,dtl_name,Duration,Guest,Item_ttl), mult = "first"]$rn)

现在,您可以使用以下方法获得所需的最终结果:

DF[!ix]

给出:

   chk_num     dtl_name Duration Guest void_type Item_ttl rn
1:    9707    Americano       45     1      <NA>       18  1
2:    9707    Americano       45     1      <NA>       18  2
3:    9707   Café-Latte       45     1      <NA>       21  5
4:    9707    Green-Tea       45     1      <NA>       18 14
5:    9707 Orange-Juice       45     1      <NA>       24 17

答案 1 :(得分:0)

我很确定有更好的方法来做到这一点。

数据:

df1<-
data.table::fread("chk_num dtl_name Duration Guest void_type Item_ttl
9707    Americano           45  1   NA  18
9707    Americano           45  1   NA  18
9707    Breakfast-Tea       45  1   NA  18
9707    Breakfast-Tea       45  1   U   -18
9707    Café-Latte          45  1   NA  21
9707    Camomille-Tea       45  1   U   -18
9707    Camomille-Tea       45  1   NA   18
9707    Earl-Grey-Tea       45  1   U   -18
9707    Earl-Grey-Tea       45  1   NA   18
9707    Fresh-Mint-Tea      45  1   U   -18
9707    Fresh-Mint-Tea      45  1   NA   18
9707    Green-Tea           45  1   NA  18
9707    Green-Tea           45  1   U   -18
9707    Green-Tea           45  1   NA  18
9707    Lemon-Tea           45  1   NA  18
9707    Lemon-Tea           45  1   U   -18
9707    Orange-Juice        45  1   NA  24
9707    Pepper-Mint-Tea     45  1   NA  18
9707    Pepper-Mint-Tea     45  1   U   -18") %>% setDF

代码:

fun1 <- function(x) {
        while("U" %in% x$void_type) {
            flagU    <- min(which(x$void_type == "U"))
            delFlagU <- min(which(x$Item_ttl == -x$Item_ttl[flagU]))
            x    <- x[-c(flagU,delFlagU),]
            if(!("U" %in% x$void_type)) {return(x)}

        }
        return(x)
    } 

df1 %>% dplyr::group_by(dtl_name, Duration, Guest) %>% dplyr::do(.,fun1(.))

结果:

# A tibble: 5 x 6
# Groups:   dtl_name, Duration, Guest [4]
#  chk_num dtl_name     Duration Guest void_type Item_ttl
#    <int> <chr>           <int> <int> <chr>        <int>
#1    9707 Americano          45     1 <NA>            18
#2    9707 Americano          45     1 <NA>            18
#3    9707 Café-Latte         45     1 <NA>            21
#4    9707 Green-Tea          45     1 <NA>            18
#5    9707 Orange-Juice       45     1 <NA>            24

请注意:

如果您有一个标记U,但没有相应的“对”,那么您将陷入无限的while循环中。

您可能想扩展一下我的答案。

当您对业务逻辑一无所知时,我对此一无所知。您可以调整分组dplyr::group_by(dtl_name, Duration, Guest)。不确定Duration的Esp。


如果您更像一个data.table人:

data.table::setDT(df1)[, fun1(.SD), by = .(dtl_name, Duration, Guest)]