R使用RegEx分割字符串但包含这些字符

时间:2015-04-10 20:28:17

标签: regex r

如何分割以下字符串?

  “Wes Anderson - 布达佩斯大酒店理查德林克莱特 - BoyhoodBennett Miller - FoxcatcherMorten Tyldum - 模仿游戏”

成:

"Wes Anderson – The Grand Budapest Hotel"
"Richard Linklater – Boyhood"
"Bennett Miller – Foxcatcher"
"Morten Tyldum – The Imitation Game"

第一个分裂点是“HotelRichard”,所以我认为包含[a-z] [A-Z]的单词可用于查找规则。但如果我用这些部分代替那些部分:

strsplit("HotelRichard", "[a-z][A-Z]") returns "Hote" "ichard".

有什么好主意吗?

3 个答案:

答案 0 :(得分:3)

您可以尝试使用此代码,我正在使用一种解决方法来插入§符号(希望,如果您的输入完全没有那么频繁),然后将其拆分:

x <- "Wes Anderson – The Grand Budapest HotelRichard Linklater – BoyhoodBennett Miller – FoxcatcherMorten Tyldum – The Imitation Game"
x <- gsub("([a-z])([A-Z])","\\1§\\2",x)
strsplit(x,"§")

Sample program输出:

[[1]]                                                                                                                                                               
[1] "Wes Anderson \342\200\223 The Grand Budapest Hotel"                                                                                                            
[2] "Richard Linklater \342\200\223 Boyhood"                                                                                                                        
[3] "Bennett Miller \342\200\223 Foxcatcher"                                                                                                                        
[4] "Morten Tyldum \342\200\223 The Imitation Game"  

答案 1 :(得分:0)

首先拆分导演/电影混搭,然后将字符串拆分为插入的&#34; xxx&#34;。第一步标记两个组,然后用它们之间的三个x替换它们。

text <- "Wes Anderson – The Grand Budapest HotelRichard Linklater – BoyhoodBennett Miller – FoxcatcherMorten Tyldum – The Imitation Game"
text.split <- str_replace_all(text, "([a-z])([A-Z])", "\\1xxx\\2")
text.final <- str_split(text.split, "xxx")
text.final
[[1]]
[1] "Wes Anderson – The Grand Budapest Hotel" "Richard Linklater – Boyhood"            
[3] "Bennett Miller – Foxcatcher"             "Morten Tyldum – The Imitation Game"

答案 2 :(得分:0)

这是使用单个正则表达式(Lookahead和Lookbehind)的方法:

strsplit(x, "(?<=[a-z])(?=[A-Z])", perl = TRUE)

## [[1]]
## [1] "Wes Anderson – The Grand Budapest Hotel"
## [2] "Richard Linklater – Boyhood"            
## [3] "Bennett Miller – Foxcatcher"            
## [4] "Morten Tyldum – The Imitation Game"