Question

如何分割以下字符串？

“Wes Anderson - 布达佩斯大酒店理查德林克莱特 - BoyhoodBennett Miller - FoxcatcherMorten Tyldum - 模仿游戏”

成：

"Wes Anderson – The Grand Budapest Hotel"
"Richard Linklater – Boyhood"
"Bennett Miller – Foxcatcher"
"Morten Tyldum – The Imitation Game"

第一个分裂点是“HotelRichard”，所以我认为包含[a-z] [A-Z]的单词可用于查找规则。但如果我用这些部分代替那些部分：

strsplit("HotelRichard", "[a-z][A-Z]") returns "Hote" "ichard".

有什么好主意吗？

Answer 1

您可以尝试使用此代码，我正在使用一种解决方法来插入§符号（希望，如果您的输入完全没有那么频繁），然后将其拆分：

x <- "Wes Anderson – The Grand Budapest HotelRichard Linklater – BoyhoodBennett Miller – FoxcatcherMorten Tyldum – The Imitation Game"
x <- gsub("([a-z])([A-Z])","\\1§\\2",x)
strsplit(x,"§")

Sample program输出：

[[1]]                                                                                                                                                               
[1] "Wes Anderson \342\200\223 The Grand Budapest Hotel"                                                                                                            
[2] "Richard Linklater \342\200\223 Boyhood"                                                                                                                        
[3] "Bennett Miller \342\200\223 Foxcatcher"                                                                                                                        
[4] "Morten Tyldum \342\200\223 The Imitation Game"

Answer 2

首先拆分导演/电影混搭，然后将字符串拆分为插入的＆＃34; xxx＆＃34;。第一步标记两个组，然后用它们之间的三个x替换它们。

text <- "Wes Anderson – The Grand Budapest HotelRichard Linklater – BoyhoodBennett Miller – FoxcatcherMorten Tyldum – The Imitation Game"
text.split <- str_replace_all(text, "([a-z])([A-Z])", "\\1xxx\\2")
text.final <- str_split(text.split, "xxx")
text.final
[[1]]
[1] "Wes Anderson – The Grand Budapest Hotel" "Richard Linklater – Boyhood"            
[3] "Bennett Miller – Foxcatcher"             "Morten Tyldum – The Imitation Game"

Answer 3

这是使用单个正则表达式（Lookahead和Lookbehind）的方法：

strsplit(x, "(?<=[a-z])(?=[A-Z])", perl = TRUE)

## [[1]]
## [1] "Wes Anderson – The Grand Budapest Hotel"
## [2] "Richard Linklater – Boyhood"            
## [3] "Bennett Miller – Foxcatcher"            
## [4] "Morten Tyldum – The Imitation Game"

R使用RegEx分割字符串但包含这些字符

3 个答案: