问题:
让我们考虑数据框df
:
df <- structure(list(id = 1:4, var1 = c("blissard", "Blizzard", "storm of snow",
"DUST DEVIL/BLIZZARD")), .Names = c("id", "var1"), class = "data.frame", row.names = c(NA,
-4L))
> df
id var1
1 "blissard"
2 "Blizzard"
3 "storm of snow"
4 "DUST DEVIL/BLIZZARD"
> class(dt$var1)
[1] "character"
我想让它整洁漂亮,因此我尝试重新编码var1
,它在一个更加亲切和可分析的va1_recoded
中拥有四个不同的条目,因此:
df$var1_recoded[grepl("[Bb][Ll][Ii]", df$var1)] <- "blizzard"
df$var1_recoded[grepl("[Ss][Tt][Oo]", df$var1)] <- "storm"
id var1 var1_recoded
1 "blissard" "blizzard"
2 "Blizzard" "blizzard"
3 "storm of snow" "storm"
4 "DUST DEVIL/BLIZZARD" "blizzard"
问题:
如何创建一个自动执行上述两个函数描述的过程的函数?用不同的话来说:如何推广(比方说)1000替换?
我会输入带有列表的函数(例如c("storm", "blizzard")
),然后将apply
作为匹配和替换尊重条件的观察的过程。
我在这里找到了宝贵的贡献:Replace multiple arguments with gsub
但我无法以编程方式在R语言中翻译上述功能。特别是,我无法创建允许grep
识别要匹配的单词的前三个字母的条件。
答案 0 :(得分:1)
这是一种可行的方法:
dat <- read.csv(text="id, var1
1, blissard
2, Blizzard
3, storm of snow
4, hurricane
5, DUST DEVIL/BLIZZARD", header=T, stringsAsFactors = FALSE, strip.white=T)
x <- c("storm", "blizzard")
if (!require("pacman")) install.packages("pacman")
pacman::p_load(stringdist, stringi)
dat[["var1_recoded"]] <- NA
tol <- .6
for (i in seq_len(nrow(dat))) {
potentials <- unlist(stri_extract_all_words(dat[["var1"]][i]))
y <- stringdistmatrix(tolower(potentials), tolower(x), method = "jaccard")
if (min(y) > tol) {
dat[["var1_recoded"]][i] <- dat[["var1"]][i]
} else {
dat[["var1_recoded"]][i] <- x[which(y == min(y), arr.ind = TRUE)[2]]
}
}
## id var1 var1_recoded
## 1 1 blissard blizzard
## 2 2 Blizzard blizzard
## 3 3 storm of snow storm
## 4 4 hurricane hurricane
## 5 5 DUST DEVIL/BLIZZARD blizzard
编辑在解决方案中纳入了@ mra68的数据
答案 1 :(得分:1)
f <- function( x )
{
A <- c( "blizzard", "storm" )
A3 <- sapply(A,substr,1,3)
x <- as.character(x)
n <- max( c( 0, which( sapply( A3, grepl, tolower(x) ) ) ) )
if ( n==0 )
{
warning( "nothing found")
return (x)
}
A[n]
}
df <- data.frame( id = 1:5,
var1 = c( "blissard", "Blizzard", "storm of snow", "DUST DEVIL/BLIZZARD", "hurricane" ) )
如果neiher“blizzard”或“storm”匹配,则“var1”保持不变(带警告)。 “飓风”就是一个例子。
> df$var1_recoded <- sapply(df$var1,f)
Warning message:
In FUN(X[[i]], ...) : nothing found
> df
id var1 var1_recoded
1 1 blissard blizzard
2 2 Blizzard blizzard
3 3 storm of snow storm
4 4 DUST DEVIL/BLIZZARD blizzard
5 5 hurricane hurricane
>