Question

我编写了一个脚本来抓取HTML文件中的不同字段，并使用结果填充变量。我正在使用正则表达式来抓取电子邮件。以下是一些示例代码：

$txt='<p class=FillText><a name="InternetMail_P3"></a>First.Last@company-name.com</p>'

$re='.*?'+'([\\w-+]+(?:\\.[\\w-+]+)*@(?:[\\w-]+\\.)+[a-zA-Z]{2,7})'

if ($txt -match $re)
{
    $email1=$matches[1]
    write-host "$email1"
}

我收到以下错误：

Bad argument to operator '-match': parsing ".*?([\\w-+]+(?:\\.[\\w-+]+)*@(?:[\\w-]+\\
.)+[a-zA-Z]{2,7})([\\w-+]+(?:\\.[\\w-+]+)*@(?:[\\w-]+\\.)+[a-zA-Z]{2,7})" - [x-y] range in reverse order..
At line:7 char:16
+ if ($txt -match <<<<  $re)
    + CategoryInfo          : InvalidOperation: (:) [], RuntimeException
    + FullyQualifiedErrorId : BadOperatorArgument

我在这里缺少什么？此外，还有更好的电子邮件正则表达式吗？

提前致谢。

Answer 1

实际上，任何适用于.Net或C＃的正则表达式都适用于PowerShell 。你可以在stackoverflow和inet找到吨和吨样本。例如：How to Find or Validate an Email Address: The Official Standard: RFC 2822

$txt='<p class=FillText><a name="InternetMail_P3"></a>First.Last@company-name.com</p>'
$re="[a-z0-9!#\$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#\$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?"
[regex]::MAtch($txt, $re, "IgnoreCase ")

但这个答案还有其他部分。 正则表达式本质上不太适合解析XML / HTML 。您可以在此处找到更多详细信息：Using regular expressions to parse HTML: why not?

为了提供真正的解决方案，我首先推荐

转换HTML→XHTML
遍历XML树
逐个使用各个节点，甚至使用正则表达式。

Answer 2

在电子邮件验证方面，我通常会选择RFC 2822的简短版本：

[A-Z0-9＃$％＆安培;！？'* + / = ^ _ {|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_ {|}〜 - ]。+）* @（?:一个-Z0-9）+一个-Z0-9？

您可以找到有关电子邮件验证的更多信息here

在Powershell中使用Regex来抓取电子邮件

2 个答案: