替换R中数据框中变量的特定字符

时间:2014-10-20 16:21:14

标签: r string replace dataframe

我想用变量DMA中的,替换所有-)(.(空格)示例数据框中的.NAME。我提到了三个帖子并尝试了他们的方法,但都失败了:

Replacing column values in data frame, not included in list

R replace all particular values in a data frame

Replace characters from a column of a data frame R

方法1

> shouldbecomeperiod <- c$DMA.NAME %in% c("-", ",", " ", "(", ")")
c$DMA.NAME[shouldbecomeperiod] <- "."

方法2

> removetext <- c("-", ",", " ", "(", ")")
c$DMA.NAME <- gsub(removetext, ".", c$DMA.NAME)
c$DMA.NAME <- gsub(removetext, ".", c$DMA.NAME, fixed = TRUE)

Warning message:
In gsub(removetext, ".", c$DMA.NAME) :
  argument 'pattern' has length > 1 and only the first element will be used

方法3

> c[c == c(" ", ",", "(", ")", "-")] <- "."

示例数据框

> df
DMA.CODE                  DATE                   DMA.NAME       count
111         22 8/14/2014 12:00:00 AM               Columbus, OH     1
112         23 7/15/2014 12:00:00 AM Orlando-Daytona Bch-Melbrn     1
79          18 7/30/2014 12:00:00 AM        Boston (Manchester)     1
99          22 8/20/2014 12:00:00 AM               Columbus, OH     1
112.1       23 7/15/2014 12:00:00 AM Orlando-Daytona Bch-Melbrn     1
208         27 7/31/2014 12:00:00 AM       Minneapolis-St. Paul     1

我知道问题 - gsub使用模式而且只使用第一个元素。另外两种方法是在整个变量中搜索确切的值,而不是在特定字符的值内搜索。

2 个答案:

答案 0 :(得分:4)

您可以使用模式组([:punct:])内的特殊组[:space:][...],如下所示:

df <- data.frame(
  DMA.NAME = c(
    "Columbus, OH",
    "Orlando-Daytona Bch-Melbrn",
    "Boston (Manchester)",
    "Columbus, OH",
    "Orlando-Daytona Bch-Melbrn",
    "Minneapolis-St. Paul"),
  stringsAsFactors=F)
##
> gsub("[[:punct:][:space:]]+","\\.",df$DMA.NAME)
[1] "Columbus.OH"                "Orlando.Daytona.Bch.Melbrn" "Boston.Manchester."         "Columbus.OH"               
[5] "Orlando.Daytona.Bch.Melbrn" "Minneapolis.St.Paul"

答案 1 :(得分:3)

如果您的数据框很大,您可能希望从stringi包中查看此快速功能。此函数将特定类的每个字符替换为另一个字符。在这种情况下,字符类是L - 字母(在{}内),但是大P(在{}之前)表示我们正在寻找此集合的补充,所以对于每个非字母字符。合并表示连续匹配应合并为一个匹配。

require(stringi)
stri_replace_all_charclass(df$DMA.NAME, "\\P{L}",".", merge=T)
## [1] "Columbus.OH"                "Orlando.Daytona.Bch.Melbrn" "Boston.Manchester."         "Columbus.OH"               
## [5] "Orlando.Daytona.Bch.Melbrn" "Minneapolis.St.Paul"   

还有一些基准:

x <- sample(df$DMA.NAME, 1000, T)
gsubFun <- function(x){
    gsub("[[:punct:][:space:]]+","\\.",x)   
}

striFun <- function(x){
    stri_replace_all_charclass(x, "\\P{L}",".", T)  
}


require(microbenchmark)
microbenchmark(gsubFun(x), striFun(x))
Unit: microseconds
       expr      min        lq   median        uq       max neval
 gsubFun(x) 3472.276 3511.0015 3538.097 3573.5835 11039.984   100
 striFun(x)  877.259  893.3945  907.769  929.8065  3189.017   100