多行正则表达式powershell

时间:2015-06-22 21:30:20

标签: regex powershell

我将很高兴收到一些与解析/正则表达式html文件代码相关的问题的解决方案:

d:\ acc.html

<!-- WebSite-Watcher Demo Report -->



<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>WebSite-Watcher Report</title>
<style type="text/css">
<!--
a:link, a:active {
    color: #4040A0;
    text-decoration: underline;
}
a:visited {
    color: #8080A0;
    text-decoration: underline;
}
a:hover {
    background: #FFF000;
    color: #FF0000;
    text-decoration: underline;
}
body, td {
   font-size: 11px;
   line-height: 15px;
   font-family: Verdana, Arial;
}
li {
   list-style: square;
   font-size: 11px;
   line-height: 15px;
   margin-top: 10px;
}
-->
</style>
</head>

<body>
<center>

<table cellpadding="2" cellspacing="2" border="0" width="80%">
<tr>
<td>
<font color="#336699" style="font-size: 18px;"><b>Highlighted changes</b></font><br>
<div style="border-top: 1px dashed dadada; margin-top: 5px;"></div>
<br>

<font color="#f00000">This report displays the first 200 characters of highlighted changes,<br>
the length can be changed individually with the <b>wsw_url_highlighted_changes(200)</b> variable.</font><br>
<br>




<table cellpadding="5" cellspacing="0" border="0" width="100%">
<tr>
<td style="border-bottom-color: #d0d0d0; border-bottom-style: solid; border-bottom-width: 1px; background-color: #eaeaea;"><!-- F1E896 -->
<font style="font-size: 13px;"><b>Lorem ipsum</b></font><br><font color="#808080"> | <a href="http://www.hjccx.com/" target="_top">Web page</a> | <a href="file://x:/wswdb/wswdatabase_wsw/0004/2015052915594644815599.htm_chg.htm#wswchange1" target="_top">Local page</a></font>
</td>
</tr>
<tr>
<td style="border-bottom-color: #f0f0f0; border-bottom-style: solid; border-bottom-width: 1px; background-color: #f8f8f8;"><!-- F5F2C7 -->
<blockquote>
<br>
</blockquote>
</td>
</tr>
</table><br>
<br>


<table cellpadding="5" cellspacing="0" border="0" width="100%">
<tr>
<td style="border-bottom-color: #d0d0d0; border-bottom-style: solid; border-bottom-width: 1px; background-color: #eaeaea;"><!-- F1E896 -->
<font style="font-size: 13px;"><b>Lorem ipsum</b></font><br><font color="#808080">18-06-2015 | <a href="http://www.no target="_top">Web page</a> | <a href="file://x:/wswdb/wswdatabase_wsw/0004/2015052915594536915585.htm_chg.htm#wswchange1" target="_top">Local page</a></font>
</td>
</tr>
<tr>
<td style="border-bottom-color: #f0f0f0; border-bottom-style: solid; border-bottom-width: 1px; background-color: #f8f8f8;"><!-- F5F2C7 -->
<blockquote>
Lorem ipsum BBBBBBBBBBBB<br>
</blockquote>
</td>
</tr>
</table><br>
<br>

<table cellpadding="5" cellspacing="0" border="0" width="100%">
<tr>
<td style="border-bottom-color: #d0d0d0; border-bottom-style: solid; border-bottom-width: 1px; background-color: #eaeaea;"><!-- F1E896 -->
<font style="font-size: 13px;"><b>Lorem ipsum</b></font><br><font color="#808080">18-06-2015 | <a href="http://www.no target="_top">Web page</a> | <a href="file://x:/wswdb/wswdatabase_wsw/0004/2015052915594536915585.htm_chg.htm#wswchange1" target="_top">Local page</a></font>
</td>
</tr>
<tr>
<td style="border-bottom-color: #f0f0f0; border-bottom-style: solid; border-bottom-width: 1px; background-color: #f8f8f8;"><!-- F5F2C7 -->
<blockquote>
Lorem ipsum BBBBBBBBBBBB<br>AAAAAAAAAAAAAAAaa AA<br>
</blockquote>
</td>
</tr>
</table><br>
<br>


<br>
<br>

<div style="border-top: 1px dashed dadada;"></div>
<font color="#808080"><i>Report date: 18-06-2015</i></font><br>
</td>
</tr>
</table><br>
</center>
</body>
</html>

我需要清洁&#39;这个文件来自第一个空条目(没有文本只是一些空格或通常只是。

我知道在powershell中有多个正则表达式的解决方案,它可能看起来像:

d:\ pattern.txt

(?=<table cellpadding="5" ).*(?=<blockquote>).{0,6}(?=<\/blockquote>)

脚本:(谢谢Jisaak)

$content = (Get-Content 'd:\acc.txt' -raw) 
$pattern = (Get-Content 'd:\pattern.txt' -raw)

[regex]::Replace($content, $pattern, '',`
     [System.Text.RegularExpressions.RegexOptions]::Multiline `
     -bor [System.Text.RegularExpressions.RegexOptions]::Singleline)

我的意思是 - (任何符号的0-6))

这个正则表达式无法正常编写这个高级正则表达式时遇到问题。谢谢你的帮助

2 个答案:

答案 0 :(得分:1)

如果您不必处理多行,这个问题会更容易吗?

我对正则表达式的体验有限且html不存在,但以下解决方法可以将您的块变为单行(并再次返回块)

$file  = (Get-Content ".\acc.html" -raw)

# Replace new line CR LF with a string (e.g. NEWLINE)
$file2 = ([regex]::Replace($file, ">`r`n", ">NEWLINE", "Singleline"))
$file2 | out-file ".\acc_edited.html"

# Single line regex replacement here to get rid of empty table.
# String NEWLINE can be used to indicate a new line.

# Replace the string with new line characters CR LF after regex replacement.
[regex]::Replace($file2, ">NEWLINE", ">`r`n", "Singleline") | Out-File ".\acc_original.html"

答案 1 :(得分:-1)

这应该有效:

(?<=<table cellpadding="5" cellspacing="0" border="0" width="100%">).*
(?=<blockquote>)|(?<=<blockquote>).{0,6}(?=<\/blockquote>)