可重现的数据集
data1 <- data.frame(ID = c(1,2), Description = c("Chiquita","Chiquita mazamorra"), Max = c(200,125))
data2 <- data.frame(ID = c(1,2,3,4,5,6,7), Description = c("Chiquita mini", "Chiquita Oriville","Chiquita 24h","Manzano Chiquita 5j...","Chiquita mazamorra 1,2h..","Chiquita mazamorra Buro","Chiquita AM 2F"), Max = c(24,110,80,90,134,123,210))
我有一个数据集data1,如下所示
Id Description Max
1 Chiquita 200
2 Chiquita mazamorra 125
我有另一个数据集data2,如下所示
Id Description Actual
1 Chiquita mini 24
2 Chiquita Oriville 110
3 Chiquita 24h 80
4 Manzano Chiquita 5j... 90
5 Chiquita mazamorra 1,2h... 134
6 Chiquita mazamorra Buro 123
7 Chiquita AM 2F 210
8 Chiquita..... 124
9 Chiquita(P) 213
10 Chiquita, mazamorra, S 188
如果语句应检查Data2描述是否包含data2中的此字符$描述 Chiquita mazamorra ,如果是,则检查Data2 $ Actual&gt;数据1 $最大。如果是,那么结果==好,否则小。请注意,在Chiquita mazamorra之后可以有其他字符,例如 Chiquita mazamorra 1,2h .. 这是可以的,但不是 Chiquita mazamorra Buro
同样,另一个ifelse应检查Data2描述是否包含 Chiquita ,如果是,则检查Data2 $ Actual&gt;数据1 $最大。如果是,那么结果==好,否则小。 Chiquita之后可以有其他角色,例如 Chiquita 24h 或 Chiquita AM 2F 这些都可以,但不是 Chiquita mini 或 Chiquita Oriville
这是最终所需的输出(data2)
Id Description Actual Result
1 Chiquita mini 24 NA
2 Chiquita Oriville 110 NA
3 Chiquita 24h 80 Small
4 Manzano Chiquita 5j... 90 NA
5 Chiquita mazamorra 1,2h... 134 Good
7 Chiquita mazamorra Buro 123 NA
6 Chiquita AM 2F 210 Good
8 Chiquita..... 124 Small
9 Chiquita(P) 213 NA
10 Chiquita, mazamorra, S 188 Good
我知道这可以使用grepl和ifelse语句的组合来完成我不是很自信吗?也许有更好的方法来做到这一点,我不知道,我变得非常困惑。需要帮忙。
答案 0 :(得分:0)
以下是解决方案的概要
data1 <- read.csv(text=
"Id,Description,Max
1,Chiquita,200
2,Chiquita mazamorra,125")
data2 <- read.csv(text=
"Id,Description,Actual
1,Chiquita mini,24
2,Chiquita Oriville,110
3,Chiquita 24h,80
4,Manzano Chiquita 5j,90
5,Chiquita mazamorra 12h,134
6,Chiquita mazamorra Buro,123
7,Chiquita AM 2F,210")
# start by trimming the description to the first few words
# that don't start with a number
data2$Description_trimmed <- gsub('\\s+\\d.*$','',data2$Description)
# initialize the output field
data2$Results <- NA
# loop while there are missing values in data$Results
while(any(is.na(data2$Results))){
# identify records that still need to be calculated
indx <- is.na(data2$Results)
# calculate the result based on the current trimmed description
data2[indx,'Results'] <- ifelse(
data2[indx,'Actual'] <
data1[match(data2[indx,'Description_trimmed'],
data1[ ,'Description']),
"Max"],
'Good',
'Small')
# trim the last word from Description_trimmed
data2$Description_trimmed <- gsub('(^| +)[^ ]*$','',data2$Description_trimmed)
# stop if the remaining trimmed descriptions are empty
if(all(grepl('^\\s*$',data2$Description_trimmed)))
break
}
data2
#> Id Description Actual Description_trimmed Results
#> 1 1 Chiquita mini 24 Good
#> 2 2 Chiquita Oriville 110 Good
#> 3 3 Chiquita 24h 80 Good
#> 4 4 Manzano Chiquita 5j 90 <NA>
#> 5 5 Chiquita mazamorra 12h 134 Small
#> 6 6 Chiquita mazamorra Buro 123 Good
#> 7 7 Chiquita AM 2F 210 Small
(BTY,这个解决方案每个循环计算is.na(data$Results)
两次,而你只需要计算一次 - 我在阅读方便,而不是在这方面的效率......)