Question

我有一个字符串向量，格式如下：

strings <- c("UUDBK", "KUVEB", "YVCYE")

我也有这样的数据框：

replacewith <- c(8, 4, 2)
searchhere <- c("UUDBK, YVCYE, KUYVE, IHVYV, IYVEK", "KUVEB, UGEVB", "KUEBN, IHBEJ, KHUDN")
dataframe <- data.frame(replacewith, searchhere)

我希望将字符串向量替换为此数据框中相应“replacewith”列中的值。目前我的方式是：

final <- sapply(as.character(strings), function(x)
as.numeric(dataframe[grep(x, dataframe$searchhere), 1]))

然而，使用长度为10 ^ 9的字符向量进行此操作非常重要。

有什么更好的方法可以做到这一点？

谢谢！

Answer 1

library(tidyr)

strings <- c("UUDBK", "KUVEB", "YVCYE")

replacewith <- c(8, 4, 2)
searchhere <- c("UUDBK, YVCYE, KUYVE, IHVYV, IYVEK", "KUVEB, UGEVB", "KUEBN, IHBEJ, KHUDN")
dataframe <- data.frame(replacewith, searchhere, stringsAsFactors = F)

# split strings to one row each
# like a look up table
d = separate_rows(dataframe, searchhere)

# get the number based on the look up table
d[d$searchhere %in% strings,]

#   replacewith searchhere
# 1           8      UUDBK
# 2           8      YVCYE
# 6           4      KUVEB

不确定您是否喜欢这种格式，但您可以随时重塑它。

Answer 2

与@ AntoniosK的想法类似，它使用hashmap将字符串映射到它们的值。 hashmap在内部使用Rcpp实现，因此速度非常快：

library(hashmap)
library(tidyr)

search_replace = separate_rows(dataframe, searchhere)

search_hash = hashmap(search_replace[,2], search_replace[,1])

search_hash[[strings]]

<强>结果：

> search_hash
## (character) => (numeric)  
##     [KHUDN] => [+2.000000]
##     [KUEBN] => [+2.000000]
##     [UGEVB] => [+4.000000]
##     [KUVEB] => [+4.000000]
##     [IYVEK] => [+8.000000]
##     [IHVYV] => [+8.000000]
##       [...] => [...] 

> search_hash[[strings]]
[1] 8 4 8

<强>基准：

> OP_func = function(){sapply(as.character(strings), function(x)
    as.numeric(dataframe[grep(x,dataframe$searchhere), 1]))}

Unit: microseconds
                           expr     min       lq      mean   median      uq      max neval
                      OP_func() 121.191 124.9410 190.36472 129.8760 151.193 3370.047   100
 d[d$searchhere %in% strings, ]  36.714  40.6605  52.85093  43.8185  61.583  147.246   100
         search_hash[[strings]]  14.212  18.1590  25.05212  21.5150  29.608   58.820   100

另请注意，如果strings中存在重复项，则@ AntoniosK的解决方案不起作用，而hashmap将为正确位置的每个元素返回正确的映射。

示例：

> strings_large = sample(search_replace$searchhere, 100, replace = TRUE) > strings_large [1] "YVCYE" "KUVEB" "KUYVE" "KHUDN" "KUYVE" "KHUDN" "KUEBN" "UUDBK" "KHUDN" "YVCYE" "IYVEK" [12] "KUEBN" "KHUDN" "IHBEJ" "YVCYE" "KHUDN" "KUEBN" "UGEVB" "UUDBK" "KUYVE" "KHUDN" "IHBEJ" [23] "IHVYV" "KUVEB" "IYVEK" "KHUDN" "KHUDN" "KUYVE" "YVCYE" "UUDBK" "KUYVE" "IHVYV" "KUYVE" [34] "KUEBN" "KUYVE" "UUDBK" "KUYVE" "KUVEB" "KUVEB" "YVCYE" "KUYVE" "KHUDN" "KUVEB" "YVCYE" [45] "IHBEJ" "YVCYE" "KHUDN" "UUDBK" "KUEBN" "IYVEK" "IHVYV" "UUDBK" "KUYVE" "KUEBN" "YVCYE" [56] "UGEVB" "YVCYE" "KUYVE" "IHVYV" "KUEBN" "IHVYV" "IHBEJ" "KUVEB" "IHVYV" "KUYVE" "KUEBN" [67] "IYVEK" "KUVEB" "KUEBN" "UGEVB" "KUEBN" "KUVEB" "IHBEJ" "KUYVE" "YVCYE" "YVCYE" "IHVYV" [78] "YVCYE" "KHUDN" "KHUDN" "YVCYE" "IYVEK" "KUYVE" "KHUDN" "UGEVB" "YVCYE" "IHVYV" "KUVEB" [89] "IYVEK" "KUEBN" "UGEVB" "UUDBK" "IYVEK" "IHBEJ" "IHBEJ" "UUDBK" "KUVEB" "UGEVB" "IYVEK" [100] "IYVEK" > search_hash[[strings_large]] [1] 8 4 8 2 8 2 2 8 2 8 8 2 2 2 8 2 2 4 8 8 2 2 8 4 8 2 2 8 8 8 8 8 8 2 8 8 8 4 4 8 8 2 4 8 [45] 2 8 2 8 2 8 8 8 8 2 8 4 8 8 8 2 8 2 4 8 8 2 8 4 2 4 2 4 2 8 8 8 8 8 2 2 8 8 8 2 4 8 8 4 [89] 8 2 4 8 8 2 2 8 4 4 8 8

如何根据数据框中的位置用数字替换字符串？

2 个答案: