容易正则表达式令人困惑

时间:2012-12-24 04:58:15

标签: regex r

我似乎无法从以下短语中获取电子邮件地址:

  

“的mailto:?fwwrp-3492801490@yahoo.com”

到目前为止,我已经尝试了

regexpr(":([^\\?*]?)", phrase)

代码的逻辑如下:

  1. 以分号字符开头:
  2. 获取每个不是问号的字符
  3. 在括号内返回这些字符。
  4. 我不确定我的正则表达式在哪里出错。

2 个答案:

答案 0 :(得分:9)

让我们看看你的正则表达式,我们会看到你出错的地方。我们将它拆开以便更容易讨论:

:            Just a literal colon, no worries here.
(            Open a capture group.
    [        Open a character class, this will match one character.
        ^    The leading ^ means "negate this class"
        \\   This ends up as a single \ when the regex engine sees it and that will
             escape the next character.
        ?    This has no special meaning inside a character class, sometimes a
             question mark is just a question mark and this is one of those
             times. Escaping a simple character doesn't do anything interesting.
        *    Again, we're in a character class so * has no special meaning.
    ]        Close the character class.
    ?        Zero or one of the preceding pattern.
)            Close the capture group.

消除噪音给我们:([^?*]?)

所以你的正则表达式实际匹配:

  

冒号后跟零个或一个不是问号或星号的字符,非问号或非星号将出现在第一个捕获组中。

这与你想要做的完全不同。一些调整应该排除你:

:([^?]*)

匹配:

  

冒号后跟任意数量的非问号,非问号将出现在第一个捕获组中。

字符类外的*是特殊的,在字符类之外它意味着“零或更多”,在字符类中它只是*

我会把它留给其他人来帮助你处理R方面的事情,我只是想让你了解正则表达式发生了什么。

答案 1 :(得分:3)

这是gsub的一种非常简单的方法:

gsub("([a-z]+:)(.*)([?]$)", "\\2", "mailto:fwwrp-3492801490@yahoo.com?")
## Or, if you expect things other than characters before the colon
gsub("(.*:)(.*)([?]$)", "\\2", "mailto:fwwrp-3492801490@yahoo.com?")
## Or, discarding the first and third groups since they aren't very useful
gsub(".*:(.*)[?]$", "\\1", "mailto:fwwrp-3492801490@yahoo.com?")

建立@TylerRinker启动的位置,您还可以使用strsplit,如下所示(以避免问题gsub}:

strsplit("mailto:fwwrp-3492801490@yahoo.com?", ":|\\?", fixed=FALSE)[[1]][2]

如果你有这样的字符串列表怎么样?

phrase <- c("mailto:fwwrp-3492801490@yahoo.com?", 
            "mailto:somefunk.y-address@Sqmpalm.net?")
phrase
# [1] "mailto:fwwrp-3492801490@yahoo.com?"  
# [2] "mailto:somefunk.y-address@Sqmpalm.net?"

## Using gsub
gsub("(.*:)(.*)([?]$)", "\\2", phrase)
# [1] "fwwrp-3492801490@yahoo.com"     "somefunk.y-address@Sqmpalm.net"

## Using strsplit
sapply(phrase, 
       function(x) strsplit(x, ":|\\?", fixed=FALSE)[[1]][2], 
       USE.NAMES=FALSE)
# [1] "fwwrp-3492801490@yahoo.com"     "somefunk.y-address@Sqmpalm.net"

我更喜欢gsub方法的简洁。