我有一个数据集,其中包含一个名为Region
的变量,代表澳大利亚境内的不同区域。以下是数据中的25行:
> head(sample.2013$Region, n = 25)
[1] QLD major urban - capital city VIC rural NSW regional - low urbanisation
[4] SA regional - low urbanisation NSW regional - low urbanisation Tas rural
[7] ACT major urban - capital city QLD rural ACT major urban - capital city
[10] NT regional - low urbanisation NSW other QLD rural
[13] ACT major urban - capital city VIC regional - high urbanisation Tas rural
[16] QLD major urban - capital city Tas rural VIC regional - high urbanisation
[19] QLD rural Tas rural VIC rural
[22] QLD other urban Tas rural VIC rural
[25] ACT major urban - capital city
36 Levels: ACT major urban - capital city NSW major urban - capital city NSW other urban ... ?
我需要根据此列中的变量创建另一个名为state
的变量。目前我只是使用强力方法来创建这样的新向量:
add_states <- function(sample.2013) {
# Add states from the region variable
sample.2013$State[grepl('NSW', sample.2013$Region) == TRUE] <- 'NSW'
sample.2013$State[grepl('VIC', sample.2013$Region) == TRUE] <- 'VIC'
sample.2013$State[grepl('QLD', sample.2013$Region) == TRUE] <- 'QLD'
sample.2013$State[grepl('WA', sample.2013$Region) == TRUE] <- 'WA'
sample.2013$State[grepl('SA', sample.2013$Region) == TRUE] <- 'SA'
sample.2013$State[grepl('Tas', sample.2013$Region) == TRUE] <- 'TAS'
sample.2013$State[grepl('TAS', sample.2013$Region) == TRUE] <- 'TAS'
sample.2013$State[grepl('ACT', sample.2013$Region) == TRUE] <- 'ACT'
sample.2013$State[grepl('NT', sample.2013$Region) == TRUE] <- 'NT'
return(sample.2013)
}
这样可以正常工作,但很难测试并且很脆弱。例如,我现在知道我可以将ignore-case
传递给grepl
,这将消除对两个塔斯马尼亚案件的需求。
我已经能够用for循环和这样的函数替换上面的'天真'方法:
add_state <- function(input, output, state) {
# Change the variable y in place, prevents duplication
output <- replace(output, grepl(state, input, ignore.case = TRUE), state)
output
}
state_codes <- c('NSW', 'VIC', 'QLD', 'WA', 'SA', 'TAS', 'ACT', 'NT')
test_vector <- head(sample.2013$Region, n = 500)
y = vector('character', length = length(test_vector))
for (i in 1:length(state_codes)) {
y <- add_state(test_vector, y, state_codes[i])
}
table(y)
y
ACT NSW NT QLD SA TAS VIC WA
14 99 50 42 49 98 92 45 11
但是这也是非常冗长的并且for循环不是惯用的R.我无法用apply函数替换这个代码并让它替换向量中的值,而不是创建一堆其他向量。
这是我使用lapply
管理的最佳内容:
add_state3 <- function(x, state) {
x <- replace(x, grepl(state, x, ignore.case = TRUE), state)
x
}
test_vector_short <- c("NSW 1", "NSW 2", "Vic", "Goo")
> output <- lapply(state_codes, add_state3, x = test_vector_short)
> output
[[1]]
[1] "NSW" "NSW" "Vic" "Goo"
[[2]]
[1] "NSW 1" "NSW 2" "VIC" "Goo"
[[3]]
[1] "NSW 1" "NSW 2" "Vic" "Goo"
[[4]]
[1] "NSW 1" "NSW 2" "Vic" "Goo"
[[5]]
[1] "NSW 1" "NSW 2" "Vic" "Goo"
[[6]]
[1] "NSW 1" "NSW 2" "Vic" "Goo"
[[7]]
[1] "NSW 1" "NSW 2" "Vic" "Goo"
[[8]]
[1] "NSW 1" "NSW 2" "Vic" "Goo"
该函数有效,它接受状态代码的每个实例并将其传递给add_state3函数,但它创建一个包含8个元素的列表,而不是替换元素。
对于长序言感到抱歉,但基本上我的问题是如何根据某些标准使用apply函数来更改向量的元素?
答案 0 :(得分:3)
您可以使用gsub
来组合搜索和替换,例如gsub('^.*\\bNT\\b.*$', 'NT')
将替换所有匹配的NT字符串(\\b
以避免类似&#34;品脱&#34;匹配&#34; NT&#34;)。
如果你使你的正则表达式像'^.*\\b(NSW|NT|QLD|...)\b.*'
,然后用\\1
(捕获的匹配)替换,你可以这样做:
state.regex <- sprintf('^.*\\b(%s)\\b.*$', paste(state_codes, collapse='|'))
# "^.*\\b(NSW|VIC|QLD|WA|SA|TAS|ACT|NT)\\b.*$"
gsub(state.regex, '\\1', test_vector_short, ignore.case=T)
# [1] "NSW" "NSW" "Vic" "Goo"
这只取决于这样一个事实,即每当你找到匹配项时,你想用匹配替换整个匹配项,并且匹配项(状态代码)可以压缩成一个正则表达式。
否则,我相信你必须像你一样循环(因为你需要进行替换,然后替换更新后的矢量)。
答案 1 :(得分:3)
似乎有STATECODE other stuff
的模式,所以你可以strsplit
并采取第一个元素
使用test
:
test <- c(
"QLD major urban - capital city",
"Vic rural",
"NSW regional - low urbanisation",
"SA regional - low urbanisation",
"NSW regional - low urbanisation",
"guff and goo"
)
result <- toupper(sapply(strsplit(test," "),`[`,1))
replace(result, !result %in% state_codes, NA)
#[1] "QLD" "VIC" "NSW" "SA" "NSW" NA
答案 2 :(得分:2)
由于每个Region
的第一个单词是状态代码,因此您可以删除其余部分并将结果用作新的state
变量:
sample.2013 <- data.frame(Region=c('QLD major urban - capital city','VIC rural','NSW regional - low urbanisation','SA regional - low urbanisation','NSW regional - low urbanisation Tas rural','ACT major urban - capital city','QLD rural','ACT major urban - capital city','NT regional - low urbanisation','NSW other','QLD rural','ACT major urban - capital city','VIC regional - high urbanisation Tas rural','QLD major urban - capital city','Tas rural','VIC regional - high urbanisation','QLD rural','Tas rural','VIC rural','QLD other urban','Tas rural','VIC rural','ACT major urban - capital city'));
sample.2013$state <- toupper(sub(' .*','',sample.2013$Region));
sample.2013;
## Region state
## 1 QLD major urban - capital city QLD
## 2 VIC rural VIC
## 3 NSW regional - low urbanisation NSW
## 4 SA regional - low urbanisation SA
## 5 NSW regional - low urbanisation Tas rural NSW
## 6 ACT major urban - capital city ACT
## 7 QLD rural QLD
## 8 ACT major urban - capital city ACT
## 9 NT regional - low urbanisation NT
## 10 NSW other NSW
## 11 QLD rural QLD
## 12 ACT major urban - capital city ACT
## 13 VIC regional - high urbanisation Tas rural VIC
## 14 QLD major urban - capital city QLD
## 15 Tas rural TAS
## 16 VIC regional - high urbanisation VIC
## 17 QLD rural QLD
## 18 Tas rural TAS
## 19 VIC rural VIC
## 20 QLD other urban QLD
## 21 Tas rural TAS
## 22 VIC rural VIC
## 23 ACT major urban - capital city ACT