使用powershell在html文件中解析自定义标记

时间:2017-07-13 20:50:39

标签: html powershell powershell-v3.0

我有一个自定义的html文件,下面的代码带有自定义标记" TBD:comment"。我想从这个标签中获取内容。

<HTML>
<BODY>
<h1> This is a heading </h1>
<P id='para1'>First Paragraph with some Random text</P>
<P>Second paragraph with more random text</P>
<A href="http://Geekeefy.wordpress.com">Cool Powershell blog</A>
<TBD:comment name="Title"><h3>Katamma katamma loge kathamma</h3> 
</TBD:comment>
<TBD:comment name="content"><h3>Lorem Ipsum is simply dummy text of the 
printing and typesetting industry. Lorem Ipsum has been the industry's 
standard dummy text ever since the 1500s, when an unk</h3> </TBD:comment>
</BODY>
</HTML>

以下代码似乎不适用于自定义标记。

enter code here
$html = Get-Content "C:\Users\sahuBaba\Desktop\ht.html" -Raw
$doc = New-Object -com "HTMLFILE"
$doc.IHTMLDocument2_write($html)

$text = $doc.body.getElementsByTagName("TBD:comment")
"Inner Text: " + $text[1].innerText

没有输出。有人可以帮忙吗?提前致谢。

1 个答案:

答案 0 :(得分:1)

尝试使用正则表达式:

$regex = New-Object Text.RegularExpressions.Regex "<TBD:comment.+?(>.+?)<\/TBD:comment>", ('singleline', 'multiline')
$content = "<your html>"
foreach($m in $regex.Matches($content)) {
    # remove leading '<'
    $m.Groups[1].Value.Substring(1)
}