高效的字符串搜索和替换

时间:2015-11-18 05:06:05

标签: regex r performance data-cleansing

我正在尝试清理包含职位的数据库中的大约200万个条目。许多都有几个缩写,我希望改为一个一致且更容易搜索的选项。到目前为止,我只是使用单独的mapply(gsub(...)命令运行该列。但是我做了大约80次更改,因此运行需要将近30分钟。 必须有一个更好的方法。我是字符串搜索的新手,我找到了*$技巧,这有帮助。有没有办法在一个mapply中进行多个搜索?我想可能更快? 任何帮助都会很棒。感谢。

以下是一些代码。测试是一个包含200万个人职位的专栏。

test <- mapply(gsub, " Admin ", " Administrator ", test)
test <- mapply(gsub, "Admin ", "Administrator ", test)
test <- mapply(gsub, " Admin*$", " Administrator", test)
test <- mapply(gsub, "Acc ", " Accounting ", test)
test <- mapply(gsub, " Admstr ", " Administrator ", test)
test <- mapply(gsub, " Anlyst ", " Analyst ", test)
test <- mapply(gsub, "Anlyst ", "Analyst ", test)
test <- mapply(gsub, " Asst ", " Assistant ", test)
test <- mapply(gsub, "Asst ", "Assistant ", test)
test <- mapply(gsub, " Assoc ", " Associate ", test)
test <- mapply(gsub, "Assoc ", "Associate ", test)

2 个答案:

答案 0 :(得分:5)

一种选择是使用mgsub

中的library(qdap)
mgsub(patternVec, replaceVec, test)

数据

patternVec <- c(" Admin ", "Admin ")
replaceVec <- c(" Administrator ",  "Administrator ")

答案 1 :(得分:3)

这是一个有效的基础R解决方案。您可以定义一个包含所有模式及其替换的数据框。然后,您在行模式下使用apply(),并在gsub()向量上为每个模式/替换组合调用test。以下是示例代码:

df <- data.frame(pattern=c(" Admin ", "Admin "),
                 replacement=c(" Administrator ", "Administrator "))

test <- c(" Admin ", "Admin ")

apply(df, 1, function(x) {
                test <<- gsub(x[1], x[2], test)
             })

> test
[1] " Administrator " "Administrator "