Question

我使用tidyr::unite使用分号作为分隔符来合并许多列。我将所有NA都更改为空格（''）。当我运行unite命令时，我得到了我想要的东西，但是还有很多带有重复分号的单元格-从空白单元格中遗留下来。这是我的字符串的示例。

string <- c('community centre;;sports hall;;;','community centre;;;;;')

在类似主题上找到此SO question之后，我想到了这个正则表达式。但这可以减少我的字符串中的尾部字符。

gsub('([[:alpha:]])\\;+', '\\;', string)

[1] "community centr;sports hal;"
[2] "community centr;"

继续前进后，我无法再走了。我想要一个能给我这个输出的正则表达式。

[1] "community centre; sports hall"
[2] "community centre"

谢谢。

Answer 1

为简单起见，我建议分两步进行。首先用;替换多个"; "，然后从字符串末尾删除"; "。范式正则表达式将更高效，但不那么直接。

string = gsub(";+", "; ", string)
string = gsub("; $", "", string)
string
# [1] "community centre; sports hall" "community centre"

Answer 2

我们可以使用：

stringr::str_remove_all(string,";(?=\\W+)|;$")
[1] "community centre;sports hall" "community centre"

Answer 3

您可以为工作使用一个正则表达式：

gsub("^;+|;+$|(;)+", "\\1", string)

或者，如果您更喜欢stringr：

stringr::str_replace_all(string, "^;+|;+$|(;)+", "\\1")

它匹配

替换只是组1的内容，如果前两个替代项匹配，则为空字符串，如果第三个替代项匹配，则为;。