Question

我想要清理那里有破折号的零售商的文字。

我是R的新手并自行编程，所以请在这里帮助我。我知道一般的REGEX。

mydata = read.csv("test4+.csv", header = TRUE)
mydata[,c("Store.Name")]

filenames <- c( "test4+.csv", "test4+.csv" )

for( f in filenames ){
  z <- readLines(f)
  a <- gsub("([S|s]potlight)\\s+(.*)", "\\1 - \\2", z)
  b <- gsub("([W|w]oolworths)\\s*(.*)", "\\1 - \\2", z)
  c <- gsub("([B|b]ig)(W)\\s*-*\\s*(.*)", "\\1 \\2 - \\3", z)

  cat(a, file=f, sep="\n")
  cat(b, file=f, sep="\n")
  cat(c, file=f, sep="\n")}


for( f in filenames ){ 
   cat(readLines(f), sep="\n")
}

其中col1应该看起来像col2：

col1                                     col2
woolworths abc                     woolworths - abc
woolworths bcd bce                 woolworths - bcd bce
spotlight blah blah (blah)         spotlight - blah blah (blah)
BigW act                           Big W - act
external                           external

Answer 1

你可以尝试一下：

a <- gsub("^(?:.*?)(\\s+)(?:.+)$", " - ", z)

等等。

我会非常坦率地告诉你，我以前从未在r使用正则表达式，或gsub，但这是我最接近的近似值。

Answer 2

这应该有效：

 gsub('([[:punct:]])|\\s+',' - ',data$column) #replace white space with " - "

使用REGEX在R中的代码不起作用，为什么以及如何修复？

2 个答案: