我正在使用一些电子邮件提取软件(惊喜)从网站中提取电子邮件。它使用正则表达式:
[A-Z0-9._%+-]+@[A-Z0-9.-]{3,65}\.[A-Z]{2,4}
但这会产生图像和电子邮件,例如 _212000482_1@80xauto.jpg
我可以更改此正则表达式,但我无法弄清楚如何排除以.png,.jpg等结尾的匹配项。
有很多关于验证电子邮件的信息 - 以及这有多难 - 但我想要做的就是从结果列表中排除图像。
答案 0 :(得分:1)
在示例文本中,不受欢迎的子字符串类似于电子邮件地址,但方便地以jpg
结尾。因此,如果使用负面预测,我们可以排除文件扩展名。
(?!\S*\.(?:jpg|png|gif|bmp)(?:[\s\n\r]|$))[A-Z0-9._%+-]+@[A-Z0-9.-]{3,65}\.[A-Z]{2,4}
现场演示
https://regex101.com/r/mU7bO3/2
示例文字
droids@gmail.com _212000482_1@80xauto.jpg More.Droids@deathstar.com
样本匹配
droids@gmail.com
More.Droids@deathstar.com
NODE EXPLANATION
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
\S* non-whitespace (all but \n, \r, \t, \f,
and " ") (0 or more times (matching the
most amount possible))
----------------------------------------------------------------------
\. '.'
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
jpg 'jpg'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
png 'png'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
gif 'gif'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
bmp 'bmp'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
[\s\n\r] any character of: whitespace (\n, \r,
\t, \f, and " "), '\n' (newline), '\r'
(carriage return)
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
$ before an optional \n, and the end of
a "line"
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
[A-Z0-9._%+-]+ any character of: 'A' to 'Z', '0' to '9',
'.', '_', '%', '+', '-' (1 or more times
(matching the most amount possible))
----------------------------------------------------------------------
@ '@'
----------------------------------------------------------------------
[A-Z0-9.-]{3,65} any character of: 'A' to 'Z', '0' to '9',
'.', '-' (between 3 and 65 times (matching
the most amount possible))
----------------------------------------------------------------------
\. '.'
----------------------------------------------------------------------
[A-Z]{2,4} any character of: 'A' to 'Z' (between 2
and 4 times (matching the most amount
possible))
----------------------------------------------------------------------