gsub替换并保留案例

时间:2015-08-31 05:48:12

标签: regex r

我一直在使用gsub来缩短较长字符串中的单词。我想缩写一个单词,然后尽可能多地继承输入的大写字母。

示例,将hilo转到hi:

x <- c("Hello World", "HELLO WORLD", "hello world", "hElLo world")

但尊重原来的问题

c("Hi World", "HI WORLD", "hi world", "hI world")

我真正想要匹配的大多数例子是“HI”“hi”和“Hi”。我并不在乎“hI”,但为了完整起见,我将其视为一种可能性。

为了完成这项任务,我有一个繁琐的方法来维护目标和替换字符串的向量

xin <- c("Hello\ ", "HELLO\ ", "hello\ ", "hElLo\ ")
xout <- c("Hi ", "HI ", "hi ", "hI ")
mapply(gsub, xin, xout, x)

这给出了正确的答案,请参阅:

     Hello      HELLO      hello      hElLo
"Hi World" "HI WORLD" "hi world" "hI world"

但这是令人尴尬和耗时且不灵活的!到目前为止,我有一个50字的家庭,我们寻求缩写,并保持所有的案例组合是令人厌倦的。

数据充满了混合案例数据混乱,因为人类输入了大约78000条记录,并且他们以各种可能的方式大写了部门和大学等词。他们输入的长句不适合打印页面允许的空间,我们被要求将它们缩短为“dept”和“univ”。如果可能,我们希望保留大写字母。

我唯一的想法看起来不像R对我来说。拆分原始输入,将前2个字母的现有大小写制成表格。

xcap <- sapply(strsplit(x, split = ""), function(x) x %in% LETTERS)[1:2, ]
> t(xcap)
      [,1]  [,2]
[1,]  TRUE FALSE
[2,]  TRUE  TRUE
[3,] FALSE FALSE
[4,] FALSE  TRUE

我很确定我可以使用该大小写信息来使其正常工作。但我还没有成功。我刚刚意识到G Grothendieck的包gsubfn可能有用,但那里的术语(“proto”对象)对我来说是新的。

我可能会继续朝这个方向前进,但现在问我是否有更直接的路线。

PJ

2 个答案:

答案 0 :(得分:2)

你的想法激励我编写这段代码。它在一个保护区中完成。 toupper函数用于大写xout string的分割字符。

x <- c("Hello World", "HELLO WORLD", "hello world", "hElLo world")

sapply(x, function(x,xout) {
  xcap<-(unlist(strsplit(unlist(strsplit(x," "))[1],"")) %in% LETTERS)
  n<-nchar(xout)
  if(length(xcap)>=n) {
   xcap<-xcap[1:n]
  }else {
    xcap<-c(xcap,rep(tail(xcap,1),n-length(xcap)))
    }
  xout<-paste(sapply(1:n,function(x) {
    if(xcap[x]) toupper(unlist(strsplit(xout,""))[x])
    else unlist(strsplit(xout,""))[x]
    }),sep = "",collapse = "")
  xin<-"hello"
  gsub(xin,xout,x[1],ignore.case = T)
  },xout="selamlar")

[output with "selamlar"]
 Hello World      HELLO WORLD      hello world      hElLo world 
"Selamlar World" "SELAMLAR WORLD" "selamlar world" "sElAmlar world" 

[output with "hi"]
Hello World HELLO WORLD hello world hElLo world 
"Hi World"  "HI WORLD"  "hi world"  "hI world" 

答案 1 :(得分:0)

我试图将此作为评论发布在上面,但超过了字数限制。好的,可以开始新的答案吗?

这是我们正在使用的解决方案。这就是@vck提出并将其包含在一些清理输入和输出的函数中的想法。这对我来说仍然有点尴尬,但最重要的是获得一些我们能够理解的方式。基于gsubfn的途径不是。

##' abbreviate words within strings, but preserve case of input
##'
##' Problem described at
##' http://stackoverflow.com/questions/32304688/gsub-replace-and-preserve-case
##' Please notify me of examples that fail
##' @param y vector of target words to be abbreviated
##' @param old replacements for target words.  must match old
##' @param new replacements for target words.  must match old
##' vector length.
##' @return vector of abbreviated words 
##' @author Paul Johnson <pauljohn@@ku.edu>
stabbr <- function(y = NULL, old = NULL, new = NULL){
    stopifnot(length(old) == length(new))
    transfwrap <- function(xxin, xxout, xx){
        sapply(xx, transf, xin = xxin, xout = xxout)
    }

    transf <- function(x, xin, xout) {
        xin <- tolower(xin)
        xcap <- (unlist(strsplit(unlist(strsplit(x," "))[1],"")) %in% LETTERS)
        n <- nchar(xout)
        if(length(xcap) >= n) {
            xcap<-xcap[1:n]
        } else {
            xcap <- c(xcap, rep(tail(xcap,1), n-length(xcap)))
        }
        xout2 <- paste(sapply(1:n,function(x) {
            if (xcap[x]) toupper(unlist(strsplit(xout,""))[x])
            else unlist(strsplit(xout,""))[x]
        }), sep = "", collapse = "")
        gsub(xin, xout2, x[1], ignore.case = T)
    }

    for (i in seq_along(old)){
        y <- transfwrap(old[i], new[i], y)
    }
    y
}

示例用法:

x <- c("Hello World", "HELLO WORLD", "hello world", "hElLo world")
xin <- c("Hello", "world")
xout <- c("hi", "wrld")
stabbr(x, xin, xout)

## Hello World HELLO WORLD hello world hElLo world 
##   "Hi Wrld"   "HI WRLD"   "hi wrld"   "hI wRLD" 
x <- c("Department of Ornithology", "DEPARTMENT of ORNITHOLOGY",
       "Dept of Ornith")
xin <- c("Department", "Ornithology")
xout <- c("Dept", "Orni")
res <- stabbr(x, xin, xout)
cbind(x, res)

##                      x                           res             
##Department of Ornithology "Department of Ornithology" "Dept of Orni"  
## DEPARTMENT of ORNITHOLOGY "DEPARTMENT of ORNITHOLOGY" "DEPT of ORNI"  
## Dept of Ornith            "Dept of Ornith"            "Dept of Ornith"

## Tolerates regular expressions.
## Suppose you want to change Department only at first word?
x <- c("Department of Ornithology", "DEPARTMENT of ORNITHOLOGY",
       "Dept of Ornith", "Ornithology Department")
## Aiming here for Department only as first word
xin <- c("^Department", " Ornithology")
xout <- c("Dept", " Orni")
res <- stabbr(x, xin, xout)
res

这种方法有很好的副作用。输出是使用输入名称的命名向量。

##    Department of Ornithology DEPARTMENT of ORNITHOLOGY  
##           "Dept of Orni"            "DEPT of ORNI" 
##
##           Dept of Ornith    Ornithology Department 
##          "Dept of Ornith"  "Ornithology Department"