如何从正则表达式电子邮件提取中排除图像

时间:2016-06-18 02:00:01

标签: regex regex-negation

我正在使用一些电子邮件提取软件(惊喜)从网站中提取电子邮件。它使用正则表达式:

[A-Z0-9._%+-]+@[A-Z0-9.-]{3,65}\.[A-Z]{2,4}

但这会产生图像和电子邮件,例如 _212000482_1@80xauto.jpg

我可以更改此正则表达式,但我无法弄清楚如何排除以.png,.jpg等结尾的匹配项。

有很多关于验证电子邮件的信息 - 以及这有多难 - 但我想要做的就是从结果列表中排除图像。

1 个答案:

答案 0 :(得分:1)

描述

在示例文本中,不受欢迎的子字符串类似于电子邮件地址,但方便地以jpg结尾。因此,如果使用负面预测,我们可以排除文件扩展名。

(?!\S*\.(?:jpg|png|gif|bmp)(?:[\s\n\r]|$))[A-Z0-9._%+-]+@[A-Z0-9.-]{3,65}\.[A-Z]{2,4}

Regular expression visualization

实施例

现场演示

https://regex101.com/r/mU7bO3/2

示例文字

droids@gmail.com _212000482_1@80xauto.jpg More.Droids@deathstar.com

样本匹配

droids@gmail.com 
More.Droids@deathstar.com

解释

NODE                     EXPLANATION
----------------------------------------------------------------------
  (?!                      look ahead to see if there is not:
----------------------------------------------------------------------
    \S*                      non-whitespace (all but \n, \r, \t, \f,
                             and " ") (0 or more times (matching the
                             most amount possible))
----------------------------------------------------------------------
    \.                       '.'
----------------------------------------------------------------------
    (?:                      group, but do not capture:
----------------------------------------------------------------------
      jpg                      'jpg'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      png                      'png'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      gif                      'gif'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      bmp                      'bmp'
----------------------------------------------------------------------
    )                        end of grouping
----------------------------------------------------------------------
    (?:                      group, but do not capture:
----------------------------------------------------------------------
      [\s\n\r]                 any character of: whitespace (\n, \r,
                               \t, \f, and " "), '\n' (newline), '\r'
                               (carriage return)
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      $                        before an optional \n, and the end of
                               a "line"
----------------------------------------------------------------------
    )                        end of grouping
----------------------------------------------------------------------
  )                        end of look-ahead
----------------------------------------------------------------------
  [A-Z0-9._%+-]+           any character of: 'A' to 'Z', '0' to '9',
                           '.', '_', '%', '+', '-' (1 or more times
                           (matching the most amount possible))
----------------------------------------------------------------------
  @                        '@'
----------------------------------------------------------------------
  [A-Z0-9.-]{3,65}         any character of: 'A' to 'Z', '0' to '9',
                           '.', '-' (between 3 and 65 times (matching
                           the most amount possible))
----------------------------------------------------------------------
  \.                       '.'
----------------------------------------------------------------------
  [A-Z]{2,4}               any character of: 'A' to 'Z' (between 2
                           and 4 times (matching the most amount
                           possible))
----------------------------------------------------------------------