我有一个字符串向量,格式如下:
strings <- c("UUDBK", "KUVEB", "YVCYE")
我也有这样的数据框:
replacewith <- c(8, 4, 2)
searchhere <- c("UUDBK, YVCYE, KUYVE, IHVYV, IYVEK", "KUVEB, UGEVB", "KUEBN, IHBEJ, KHUDN")
dataframe <- data.frame(replacewith, searchhere)
我希望将字符串向量替换为此数据框中相应“replacewith”列中的值。目前我的方式是:
final <- sapply(as.character(strings), function(x)
as.numeric(dataframe[grep(x, dataframe$searchhere), 1]))
然而,使用长度为10 ^ 9的字符向量进行此操作非常重要。
有什么更好的方法可以做到这一点?
谢谢!
答案 0 :(得分:2)
library(tidyr)
strings <- c("UUDBK", "KUVEB", "YVCYE")
replacewith <- c(8, 4, 2)
searchhere <- c("UUDBK, YVCYE, KUYVE, IHVYV, IYVEK", "KUVEB, UGEVB", "KUEBN, IHBEJ, KHUDN")
dataframe <- data.frame(replacewith, searchhere, stringsAsFactors = F)
# split strings to one row each
# like a look up table
d = separate_rows(dataframe, searchhere)
# get the number based on the look up table
d[d$searchhere %in% strings,]
# replacewith searchhere
# 1 8 UUDBK
# 2 8 YVCYE
# 6 4 KUVEB
不确定您是否喜欢这种格式,但您可以随时重塑它。
答案 1 :(得分:2)
与@ AntoniosK的想法类似,它使用hashmap
将字符串映射到它们的值。 hashmap
在内部使用Rcpp
实现,因此速度非常快:
library(hashmap)
library(tidyr)
search_replace = separate_rows(dataframe, searchhere)
search_hash = hashmap(search_replace[,2], search_replace[,1])
search_hash[[strings]]
<强>结果:强>
> search_hash
## (character) => (numeric)
## [KHUDN] => [+2.000000]
## [KUEBN] => [+2.000000]
## [UGEVB] => [+4.000000]
## [KUVEB] => [+4.000000]
## [IYVEK] => [+8.000000]
## [IHVYV] => [+8.000000]
## [...] => [...]
> search_hash[[strings]]
[1] 8 4 8
<强>基准:强>
> OP_func = function(){sapply(as.character(strings), function(x)
as.numeric(dataframe[grep(x,dataframe$searchhere), 1]))}
Unit: microseconds
expr min lq mean median uq max neval
OP_func() 121.191 124.9410 190.36472 129.8760 151.193 3370.047 100
d[d$searchhere %in% strings, ] 36.714 40.6605 52.85093 43.8185 61.583 147.246 100
search_hash[[strings]] 14.212 18.1590 25.05212 21.5150 29.608 58.820 100
另请注意,如果strings
中存在重复项,则@ AntoniosK的解决方案不起作用,而hashmap
将为正确位置的每个元素返回正确的映射。
示例:强>
> strings_large = sample(search_replace$searchhere, 100, replace = TRUE)
> strings_large
[1] "YVCYE" "KUVEB" "KUYVE" "KHUDN" "KUYVE" "KHUDN" "KUEBN" "UUDBK" "KHUDN" "YVCYE" "IYVEK"
[12] "KUEBN" "KHUDN" "IHBEJ" "YVCYE" "KHUDN" "KUEBN" "UGEVB" "UUDBK" "KUYVE" "KHUDN" "IHBEJ"
[23] "IHVYV" "KUVEB" "IYVEK" "KHUDN" "KHUDN" "KUYVE" "YVCYE" "UUDBK" "KUYVE" "IHVYV" "KUYVE"
[34] "KUEBN" "KUYVE" "UUDBK" "KUYVE" "KUVEB" "KUVEB" "YVCYE" "KUYVE" "KHUDN" "KUVEB" "YVCYE"
[45] "IHBEJ" "YVCYE" "KHUDN" "UUDBK" "KUEBN" "IYVEK" "IHVYV" "UUDBK" "KUYVE" "KUEBN" "YVCYE"
[56] "UGEVB" "YVCYE" "KUYVE" "IHVYV" "KUEBN" "IHVYV" "IHBEJ" "KUVEB" "IHVYV" "KUYVE" "KUEBN"
[67] "IYVEK" "KUVEB" "KUEBN" "UGEVB" "KUEBN" "KUVEB" "IHBEJ" "KUYVE" "YVCYE" "YVCYE" "IHVYV"
[78] "YVCYE" "KHUDN" "KHUDN" "YVCYE" "IYVEK" "KUYVE" "KHUDN" "UGEVB" "YVCYE" "IHVYV" "KUVEB"
[89] "IYVEK" "KUEBN" "UGEVB" "UUDBK" "IYVEK" "IHBEJ" "IHBEJ" "UUDBK" "KUVEB" "UGEVB" "IYVEK"
[100] "IYVEK"
> search_hash[[strings_large]]
[1] 8 4 8 2 8 2 2 8 2 8 8 2 2 2 8 2 2 4 8 8 2 2 8 4 8 2 2 8 8 8 8 8 8 2 8 8 8 4 4 8 8 2 4 8
[45] 2 8 2 8 2 8 8 8 8 2 8 4 8 8 8 2 8 2 4 8 8 2 8 4 2 4 2 4 2 8 8 8 8 8 2 2 8 8 8 2 4 8 8 4
[89] 8 2 4 8 8 2 2 8 4 4 8 8