首先发布stackoverflow,希望是第一篇。
我正在清理其中一列中包含作者列表的数据集。当有多个作者时,这些作者被和号分开,例如。史密斯&银行。但是,间距并不总是一致的,例如。史密斯和放大器;班克斯,史密斯和班克斯。
为了解决这个问题,我尝试过:
gsub('\\S&','\\S &', dataset[,author.col])
这给了Smith&银行 - > SmitS&银行。我怎么才能得到 - >史密斯&银行?
答案 0 :(得分:3)
这是另一种gsub
方法:
# some test cases
authors <- c("Smith& Banks", "Smith &Banks", "Smith&Banks", "Smith & Banks")
gsub("\\s*&\\s*", " & ", authors)
#[1] "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks"
更多测试用例(超过2位作者,单位作者):
authors <- c("Smith& Banks", "Smith &Banks &Nash", "Smith&Banks", "Smith & Banks", "Smith")
gsub("\\s*&\\s*", " & ", authors)
#[1] "Smith & Banks" "Smith & Banks & Nash" "Smith & Banks" "Smith & Banks" "Smith"
正如OP在他们的问题评论中指出的那样,两位作者之间的多个&符不会出现在数据中。
答案 1 :(得分:2)
这是一个解决方案,可以对gsub
进行两次调用:
dataset[,author.col] <- gsub('([^\\s])&','\\1\\s&', dataset[,author.col])
dataset[,author.col] <- gsub('&([^\\s])','&\\s\\1', dataset[,author.col])
答案 2 :(得分:2)
以下是仅使用sub
sub("\\b(?=&)|(?<=&)\\b", " ", v1, perl = TRUE)
#[1] "Smith & Banks" "Smith & Banks"
使用更多组合的数据。在上面,我只考虑了OP的帖子中显示的选项。
gsub("\\s*(?=&)|(?<=&)\\s*", " ", data, perl = TRUE)
#[1] "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks"
gsub("\\s*&+|\\&+\\s*", " & ", data1)
#[1] "Smith & Banks" "Smith & Banks" "Smith & Banks"
#[4]"Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks"
或strsplit
sapply(strsplit(data1, "\\s*&+\\s*"), paste, collapse = " & ")
#[1] "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks"
#[5] "Smith & Banks" "Smith & Banks" "Smith & Banks"
从本质上讲,如果有很多模式,strsplit
方法会更好。
v1 <- c("Smith& Banks", "Smith &Banks")
data = c("Smith& Banks", "Smith &Banks", "Smith & Banks",
"Smith & Banks", "Smith&Banks")
data1 <- c(v1, "Smith&& Banks", "Smith && Banks", "Smith&&Banks")
答案 3 :(得分:2)
使用stringi
:
v <- c("Smith & Banks", "Smith& Banks", "Smith &Banks", "Smith&Banks", "Smith Banks")
library(stringi)
#create an index of entries containing "&"
indx <- grepl("&", v)
#subset "v" using that index
amp <- v[indx]
#perform the transformation on that subset and combine the result with the rest of "v"
c(sapply(stri_extract_all_words(amp),
function(x) { paste0(x, collapse = " & ") }), v[!indx])
给出了:
#[1] "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith Banks"
答案 4 :(得分:0)
data = c("Smith& Banks", "Smith &Banks", "Smith & Banks",
"Smith & Banks", "Smith&Banks")
# Take the 0 or more spaces before and after the ampersand, replace that by " & ""
gsub("[ ]*&[ ]*", " & ", data)
# [1] "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks"
答案 5 :(得分:0)
也试试这个:
gsub("([^& ]+)\\W+([^&]+)","\\1 & \\2",authors)
[1] "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks"