嵌套if else用于字符串搜索

时间:2015-02-23 00:18:29

标签: r if-statement dplyr grepl

可重现的数据集

   data1 <- data.frame(ID = c(1,2), Description = c("Chiquita","Chiquita mazamorra"), Max = c(200,125))
   data2 <- data.frame(ID = c(1,2,3,4,5,6,7), Description = c("Chiquita mini", "Chiquita Oriville","Chiquita 24h","Manzano Chiquita 5j...","Chiquita mazamorra 1,2h..","Chiquita mazamorra Buro","Chiquita AM 2F"), Max = c(24,110,80,90,134,123,210))

我有一个数据集data1,如下所示

  Id     Description            Max
  1      Chiquita               200
  2      Chiquita mazamorra     125

我有另一个数据集data2,如下所示

  Id     Description                   Actual
  1      Chiquita mini                 24
  2      Chiquita Oriville             110
  3      Chiquita 24h                  80
  4      Manzano Chiquita 5j...        90
  5      Chiquita mazamorra 1,2h...    134
  6      Chiquita mazamorra Buro       123
  7      Chiquita AM 2F                210
  8      Chiquita.....                 124
  9      Chiquita(P)                   213
  10     Chiquita, mazamorra, S        188                   

如果语句应检查Data2描述是否包含data2中的此字符$描述 Chiquita mazamorra ,如果是,则检查Data2 $ Actual&gt;数据1 $最大。如果是,那么结果==好,否则小。请注意,在Chiquita mazamorra之后可以有其他字符,例如 Chiquita mazamorra 1,2h .. 这是可以的,但不是 Chiquita mazamorra Buro

同样,另一个ifelse应检查Data2描述是否包含 Chiquita ,如果是,则检查Data2 $ Actual&gt;数据1 $最大。如果是,那么结果==好,否则小。 Chiquita之后可以有其他角色,例如 Chiquita 24h Chiquita AM 2F 这些都可以,但不是 Chiquita mini Chiquita Oriville

这是最终所需的输出(data2)

  Id     Description                   Actual      Result
  1      Chiquita mini                 24          NA
  2      Chiquita Oriville             110         NA
  3      Chiquita 24h                  80          Small
  4      Manzano Chiquita 5j...        90          NA
  5      Chiquita mazamorra 1,2h...    134         Good         
  7      Chiquita mazamorra Buro       123         NA
  6      Chiquita AM 2F                210         Good
  8      Chiquita.....                 124         Small
  9      Chiquita(P)                   213         NA
  10     Chiquita, mazamorra, S        188         Good

我知道这可以使用grepl和ifelse语句的组合来完成我不是很自信吗?也许有更好的方法来做到这一点,我不知道,我变得非常困惑。需要帮忙。

1 个答案:

答案 0 :(得分:0)

以下是解决方案的概要

data1 <- read.csv(text=
"Id,Description,Max
1,Chiquita,200
2,Chiquita mazamorra,125")

data2 <- read.csv(text=
"Id,Description,Actual
1,Chiquita mini,24
2,Chiquita Oriville,110
3,Chiquita 24h,80
4,Manzano Chiquita 5j,90
5,Chiquita mazamorra 12h,134
6,Chiquita mazamorra Buro,123
7,Chiquita AM 2F,210")


# start by trimming the description to the first few words 
# that don't start with a number
data2$Description_trimmed <- gsub('\\s+\\d.*$','',data2$Description)

# initialize the output field
data2$Results <- NA

# loop while there are missing values in data$Results
while(any(is.na(data2$Results))){

    # identify records that still need to be calculated
    indx <- is.na(data2$Results)

    # calculate the result based on the current trimmed description
    data2[indx,'Results']  <-  ifelse(
                data2[indx,'Actual']  < 
                    data1[match(data2[indx,'Description_trimmed'],
                                data1[    ,'Description']),
                          "Max"],
                'Good',
                'Small')

    # trim the last word from Description_trimmed
    data2$Description_trimmed <- gsub('(^| +)[^ ]*$','',data2$Description_trimmed)

    # stop if the remaining trimmed descriptions are empty
    if(all(grepl('^\\s*$',data2$Description_trimmed)))
        break
}

data2
#>   Id             Description Actual Description_trimmed Results
#> 1  1           Chiquita mini     24                        Good
#> 2  2       Chiquita Oriville    110                        Good
#> 3  3            Chiquita 24h     80                        Good
#> 4  4     Manzano Chiquita 5j     90                        <NA>
#> 5  5  Chiquita mazamorra 12h    134                       Small
#> 6  6 Chiquita mazamorra Buro    123                        Good
#> 7  7          Chiquita AM 2F    210                       Small

(BTY,这个解决方案每个循环计算is.na(data$Results)两次,而你只需要计算一次 - 我在阅读方便,而不是在这方面的效率......)