gsub - 在&之前/之后添加空格字符

时间:2016-07-14 12:25:46

标签: regex r gsub

首先发布stackoverflow,希望是第一篇。

我正在清理其中一列中包含作者列表的数据集。当有多个作者时,这些作者被和号分开,例如。史密斯&银行。但是,间距并不总是一致的,例如。史密斯和放大器;班克斯,史密斯和班克斯。

为了解决这个问题,我尝试过:

     gsub('\\S&','\\S &', dataset[,author.col])

这给了Smith&银行 - > SmitS&银行。我怎么才能得到 - >史密斯&银行?

6 个答案:

答案 0 :(得分:3)

这是另一种gsub方法:

# some test cases
authors <- c("Smith& Banks", "Smith   &Banks", "Smith&Banks", "Smith & Banks")
gsub("\\s*&\\s*", " & ", authors)
#[1] "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks"

更多测试用例(超过2位作者,单位作者):

authors <- c("Smith& Banks", "Smith   &Banks &Nash", "Smith&Banks", "Smith & Banks", "Smith")
gsub("\\s*&\\s*", " & ", authors)
#[1] "Smith & Banks"        "Smith & Banks & Nash" "Smith & Banks"        "Smith & Banks"        "Smith"

正如OP在他们的问题评论中指出的那样,两位作者之间的多个&符不会出现在数据中。

答案 1 :(得分:2)

这是一个解决方案,可以对gsub进行两次调用:

dataset[,author.col] <- gsub('([^\\s])&','\\1\\s&', dataset[,author.col])
dataset[,author.col] <- gsub('&([^\\s])','&\\s\\1', dataset[,author.col])

答案 2 :(得分:2)

以下是仅使用sub

的方法
sub("\\b(?=&)|(?<=&)\\b", " ",  v1, perl = TRUE)
#[1] "Smith & Banks" "Smith & Banks"

使用更多组合的数据。在上面,我只考虑了OP的帖子中显示的选项。

 gsub("\\s*(?=&)|(?<=&)\\s*", " ", data, perl = TRUE)
 #[1] "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks"

 gsub("\\s*&+|\\&+\\s*", " & ", data1)
 #[1] "Smith &  Banks" "Smith & Banks"  "Smith & Banks"  
 #[4]"Smith & Banks"  "Smith & Banks"  "Smith &  Banks" "Smith & Banks" 

strsplit

sapply(strsplit(data1, "\\s*&+\\s*"), paste, collapse = " & ")
#[1] "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks" 
#[5] "Smith & Banks" "Smith & Banks" "Smith & Banks"

从本质上讲,如果有很多模式,strsplit方法会更好。

数据

v1 <- c("Smith& Banks", "Smith &Banks")
data = c("Smith& Banks", "Smith &Banks", "Smith & Banks", 
     "Smith &     Banks", "Smith&Banks")
data1 <- c(v1, "Smith&& Banks", "Smith && Banks", "Smith&&Banks")

答案 3 :(得分:2)

使用stringi

的过度杀伤方式
v <- c("Smith & Banks", "Smith& Banks", "Smith &Banks", "Smith&Banks", "Smith Banks")

library(stringi)
#create an index of entries containing "&"
indx <- grepl("&", v)
#subset "v" using that index
amp  <- v[indx]
#perform the transformation on that subset and combine the result with the rest of "v"
c(sapply(stri_extract_all_words(amp), 
         function(x) { paste0(x, collapse = " & ") }), v[!indx])

给出了:

#[1] "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith Banks" 

答案 4 :(得分:0)

data = c("Smith& Banks", "Smith &Banks", "Smith & Banks", 
         "Smith &     Banks", "Smith&Banks")

# Take the 0 or more spaces before and after the ampersand, replace that by " & ""
gsub("[ ]*&[ ]*", " & ", data) 
# [1] "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks"

答案 5 :(得分:0)

也试试这个:

gsub("([^& ]+)\\W+([^&]+)","\\1 & \\2",authors)
[1] "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks"