Question

首先发布stackoverflow，希望是第一篇。

我正在清理其中一列中包含作者列表的数据集。当有多个作者时，这些作者被和号分开，例如。史密斯＆amp;银行。但是，间距并不总是一致的，例如。史密斯和放大器;班克斯，史密斯和班克斯。

为了解决这个问题，我尝试过：

     gsub('\\S&','\\S &', dataset[,author.col])

这给了Smith＆amp;银行 - ＆gt; SmitS＆amp;银行。我怎么才能得到 - ＆gt;史密斯＆amp;银行？

Answer 1

这是另一种gsub方法：

# some test cases
authors <- c("Smith& Banks", "Smith   &Banks", "Smith&Banks", "Smith & Banks")
gsub("\\s*&\\s*", " & ", authors)
#[1] "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks"

更多测试用例（超过2位作者，单位作者）：

authors <- c("Smith& Banks", "Smith   &Banks &Nash", "Smith&Banks", "Smith & Banks", "Smith")
gsub("\\s*&\\s*", " & ", authors)
#[1] "Smith & Banks"        "Smith & Banks & Nash" "Smith & Banks"        "Smith & Banks"        "Smith"

正如OP在他们的问题评论中指出的那样，两位作者之间的多个＆符不会出现在数据中。

Answer 2

这是一个解决方案，可以对gsub进行两次调用：

dataset[,author.col] <- gsub('([^\\s])&','\\1\\s&', dataset[,author.col])
dataset[,author.col] <- gsub('&([^\\s])','&\\s\\1', dataset[,author.col])

Answer 3

以下是仅使用sub

的方法

sub("\\b(?=&)|(?<=&)\\b", " ",  v1, perl = TRUE)
#[1] "Smith & Banks" "Smith & Banks"

使用更多组合的数据。在上面，我只考虑了OP的帖子中显示的选项。

 gsub("\\s*(?=&)|(?<=&)\\s*", " ", data, perl = TRUE)
 #[1] "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks"

 gsub("\\s*&+|\\&+\\s*", " & ", data1)
 #[1] "Smith &  Banks" "Smith & Banks"  "Smith & Banks"  
 #[4]"Smith & Banks"  "Smith & Banks"  "Smith &  Banks" "Smith & Banks"

或strsplit

sapply(strsplit(data1, "\\s*&+\\s*"), paste, collapse = " & ")
#[1] "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks" 
#[5] "Smith & Banks" "Smith & Banks" "Smith & Banks"

从本质上讲，如果有很多模式，strsplit方法会更好。

数据

v1 <- c("Smith& Banks", "Smith &Banks")
data = c("Smith& Banks", "Smith &Banks", "Smith & Banks", 
     "Smith &     Banks", "Smith&Banks")
data1 <- c(v1, "Smith&& Banks", "Smith && Banks", "Smith&&Banks")

Answer 4

使用stringi：

的过度杀伤方式

v <- c("Smith & Banks", "Smith& Banks", "Smith &Banks", "Smith&Banks", "Smith Banks")

library(stringi)
#create an index of entries containing "&"
indx <- grepl("&", v)
#subset "v" using that index
amp  <- v[indx]
#perform the transformation on that subset and combine the result with the rest of "v"
c(sapply(stri_extract_all_words(amp), 
         function(x) { paste0(x, collapse = " & ") }), v[!indx])

给出了：

#[1] "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith Banks"

Answer 5

data = c("Smith& Banks", "Smith &Banks", "Smith & Banks", 
         "Smith &     Banks", "Smith&Banks")

# Take the 0 or more spaces before and after the ampersand, replace that by " & ""
gsub("[ ]*&[ ]*", " & ", data) 
# [1] "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks"

Answer 6

也试试这个：

gsub("([^& ]+)\\W+([^&]+)","\\1 & \\2",authors)
[1] "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks"

gsub - 在＆amp;之前/之后添加空格字符

6 个答案:

数据