使用正则表达式更改长字符串中的单词顺序

时间:2017-11-15 15:15:27

标签: r regex

我有一个很长的字符串,我想在其中更改单词顺序。我想使用正则表达式,因为我有多个元素要改变,我想同时学习。以下是我的字符串示例:

vec1 <- c("Internet-Devices Used to Access Internet Past 30 Days [Desktop Computer-Owned by Self]", 
     "Internet-Devices Used to Access Internet Past 30 Days [Tablet-Owned by Other HH Member]", 
     "Internet-Devices Used to Access Internet Past 30 Days [Laptop Computer-Made Available by Your Employer]",
     "Radio Stations-Listened to Past Week-Quebec City [FM-CFEL-102.1 (blvd 102.1)]")

vec1
[1] "Internet-Devices Used to Access Internet Past 30 Days [Desktop Computer-Owned by Self]"                 
[2] "Internet-Devices Used to Access Internet Past 30 Days [Tablet-Owned by Other HH Member]"                
[3] "Internet-Devices Used to Access Internet Past 30 Days [Laptop Computer-Made Available by Your Employer]"
[4] "Radio Stations-Listened to Past Week-Quebec City [FM-CFEL-102.1 (blvd 102.1)]"

我希望它成为:

[1] "Internet-Devices Used to Access Internet Past 30 Days -Owned by Self[Desktop Computer]"                 
[2] "Internet-Devices Used to Access Internet Past 30 Days -Owned by Other HH Member[Tablet]"                
[3] "Internet-Devices Used to Access Internet Past 30 Days -Made Available by Your Employer[Laptop Computer]"
[4] "Radio Stations-Listened to Past Week-Quebec City [FM-CFEL-102.1 (blvd 102.1)]"

所以我认为算法应该这样工作:

  1. 在“过去30天”后查找字符串的一部分,并在连字符处停止,

  2. 将此提取的字符串复制到主字符串的最后一个字符

  3. 之前
  4. 从主字符串中的步骤1中删除提取的字符串(但不是您刚刚添加的字符串)。

  5. 对于第1步,我昨天提出了一个类似的问题(Ignore part of a string when splitting using regular expression in R)并用它来查找这个正则表达式(?<=Past 30 Days ).+(?![^-]),它适用于regex101.com但不适用于R(不会)停在连字符处:

    reg1 <- regexec(pattern = "(?<=Past 30 Days ).+(?![^-])", vec1, perl=T)
    ext1 <- unname(mapply(function(xx,yy) substr(xx, yy, yy+attr(yy,"match.length")), vec1, reg1))
    ext1
    [1] "[Desktop Computer-Owned by Self]"                  "[Tablet-Owned by Other HH Member]"                
    [3] "[Laptop Computer-Made Available by Your Employer]" ""
    

    正如你所看到的,它并不止于连字符。

    第二步,我想到的是这样的事情:

    vec2 <- unname(mapply(gsub, ext1, vec1, MoreArgs = list(pattern="]")))
    vec2
    [1] "Internet-Devices Used to Access Internet Past 30 Days [Desktop Computer-Owned by Self[Desktop Computer-Owned by Self]"                                  
    [2] "Internet-Devices Used to Access Internet Past 30 Days [Tablet-Owned by Other HH Member[Tablet-Owned by Other HH Member]"                                
    [3] "Internet-Devices Used to Access Internet Past 30 Days [Laptop Computer-Made Available by Your Employer[Laptop Computer-Made Available by Your Employer]"
    [4] "Radio Stations-Listened to Past Week-Quebec City [FM-CFEL-102.1 (blvd 102.1)" 
    

    除了在向量的最后一个元素中删除“]”并且没有添加正确的字符串(因为问题1)之外,这几乎是我想要的。

    最后,我删除了字符串的初始部分:

    unname(mapply(gsub, paste0(stringr::str_sub(ext1, end=-2),"["), vec2, MoreArgs = list(replacement="[", fixed=T)))
    [1] "Internet-Devices Used to Access Internet Past 30 Days [Desktop Computer-Owned by Self]"                 
    [2] "Internet-Devices Used to Access Internet Past 30 Days [Tablet-Owned by Other HH Member]"                
    [3] "Internet-Devices Used to Access Internet Past 30 Days [Laptop Computer-Made Available by Your Employer]"
    [4] "Radio Stations-Listened to Past Week-Quebec City [FM-CFEL-102.1 (blvd 102.1)"
    

    这种工作,但我遇到与第2步相同的2个问题。

    我的整个代码看起来非常沉重和复杂。有没有更好的方法呢?

    注意:

    • 我不是在寻找一个超级强大的解决方案
    • 我从来没有嵌套括号
    • 我的字符串总是用括号
    • 结束

1 个答案:

答案 0 :(得分:2)

您可以使用

(Past 30 Days\s*)([^-]*)([^]]+)

并替换为\1\3\2。请参阅regex demo

<强>详情

  • (Past 30 Days\s*) - 第1组(从替换模式引用\1反向引用):
    • Past 30 Days - 文字子字符串
    • \s* - 0+ whitespaces
  • ([^-]*) - 第2组:除-
  • 以外的零个或多个字符
  • ([^]]+) - 第3组:]以外的一个或多个字符。

查看R demo online

vec1 <- c("Internet-Devices Used to Access Internet Past 30 Days [Desktop Computer-Owned by Self]", 
     "Internet-Devices Used to Access Internet Past 30 Days [Tablet-Owned by Other HH Member]", 
     "Internet-Devices Used to Access Internet Past 30 Days [Laptop Computer-Made Available by Your Employer]",
     "Radio Stations-Listened to Past Week-Quebec City [FM-CFEL-102.1 (blvd 102.1)]")
gsub("(Past 30 Days\\s*)([^-]*)([^]]+)", "\\1\\3\\2", vec1)
# [1] "Internet-Devices Used to Access Internet Past 30 Days -Owned by Self[Desktop Computer]"                 
# [2] "Internet-Devices Used to Access Internet Past 30 Days -Owned by Other HH Member[Tablet]"                
# [3] "Internet-Devices Used to Access Internet Past 30 Days -Made Available by Your Employer[Laptop Computer]"
# [4] "Radio Stations-Listened to Past Week-Quebec City [FM-CFEL-102.1 (blvd 102.1)]"