我有很多HTML文件,我需要从中提取文本。如果它全部在一行上,我可以很容易地做到这一点,但如果标签环绕或在多行上,我无法想象如何做到这一点。这就是我的意思:
<section id="MySection">
Some text here
another line here <br>
last line of text.
</section>
我并不关心<br>
文本,除非它有助于包装文本。我想要的区域始终以“MySection”开头,然后以</section>
结束。我想最终得到的是这样的:
Some text here another line here last line of text.
我更喜欢像vbscript或命令行选项(sed?),但我不知道从哪里开始。有什么帮助吗?
答案 0 :(得分:4)
通常,您将使用Internet Explorer COM对象:
root = "C:\base\dir"
Set ie = CreateObject("InternetExplorer.Application")
For Each f In fso.GetFolder(root).Files
ie.Navigate "file:///" & f.Path
While ie.Busy : WScript.Sleep 100 : Wend
text = ie.document.getElementById("MySection").innerText
WScript.Echo Replace(text, vbNewLine, "")
Next
但是,在IE 9之前不支持<section>
标记,即使在IE 9中,COM对象似乎也没有正确处理它,因为getElementById("MySection")
只返回开始标记:< / p>
>>> wsh.echo ie.document.getelementbyid("MySection").outerhtml
<SECTION id=MySection>
您可以改为使用正则表达式:
root = "C:\base\dir"
Set fso = CreateObject("Scripting.FileSystemObject")
Set re1 = New RegExp
re1.Pattern = "<section id=""MySection"">([\s\S]*?)</section>"
re1.Global = False
re2.IgnoreCase = True
Set re2 = New RegExp
re2.Pattern = "(<br>|\s)+"
re2.Global = True
re2.IgnoreCase = True
For Each f In fso.GetFolder(root).Files
html = fso.OpenTextFile(filename).ReadAll
Set m = re1.Execute(html)
If m.Count > 0 Then
text = Trim(re2.Replace(m.SubMatches(0).Value, " "))
End If
WScript.Echo text
Next
答案 1 :(得分:1)
这是一个使用perl
的单行解决方案和来自Mojolicious
框架的HTML解析器:
perl -MMojo::DOM -E '
say Mojo::DOM->new( do { undef $/; <> } )->at( q|#MySection| )->text
' index.html
假设index.html
包含以下内容:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
</head>
<body id="portada">
<section id="MySection">
Some text here
another line here <br>
last line of text.
</section>
</body>
</html>
它产生:
Some text here another line here last line of text.