Question

"CATARACT; #大腿骨~2010"

我需要使用大腿骨在R中选择gsub，它实际上是以&#开头的后跟五位数的unicode，然后以;结束

我知道如何使用以下方法摆脱这些unicode：

gsub("&#[0-9]+;","","CATARACT; #大腿骨~2010")

但是如何使用gsub保留这些unicode？

编辑01

我想要的输出是大腿骨。

编辑02

感谢您的回答，但如果模式并非总是如此，我需要拿起unicode，无论它们在哪里：

"CATARACT; #大腿骨~2010;CATARACT; #夨膀骩~2010"

Answer 1

E.g。使用gregexpr和regmatches：

ex <- "CATARACT; #&#22823;&#33151;&#39592;~2010;CATARACT; #&#22824;&#33152;&#39593;~2010"
m <- gregexpr("&#[0-9]+;", ex)
(r <- regmatches(ex, m))
# [[1]]
# [1] "&#22823;" "&#33151;" "&#39592;" "&#22824;" "&#33152;" "&#39593;"

paste(r[[1]], collapse="")
# [1] "&#22823;&#33151;&#39592;&#22824;&#33152;&#39593;"

Answer 2

你可以尝试：

 gsub("(^\\D*)((&#[0-9]+;)+)(.*$)", "\\2", x)

使用gsub在R中拾取具有特定模式的字符串

编辑01

编辑02

2 个答案: