我有一个.txt
格式的HTML文档,其中包含多个表和其他文本,我试图删除任何HTML(“<>
”内的任何内容),如果它在表格内({{1}之间}和<table>
)。例如:
</table>
最终输出如下。请注意,只有HTML中的HTML被删除。
===================
other text
<other HTML>
<table>
<b><u><i>bold underlined italic text</b></u></i>
</table>
other text
<other HTML>
==============
非常感谢任何帮助!
答案 0 :(得分:4)
Imports System.Windows.Forms.HtmlDocument
Imports System.IO.File
Public Class Form1
Private Sub Form1_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load
Dim myHTMLString As String
Dim myDoc As HtmlDocument
Dim myTables As HtmlElementCollection
Dim myTable As HtmlElement
Dim myAllTags As HtmlElementCollection
Dim myHTMLTag As HtmlElement
myHTMLString = ReadAllText("C:\Users\Geoffrey Van Wyk\Desktop\myPage1.txt")
WebBrowser1.DocumentText = myHTMLString
myDoc = WebBrowser1.Document.OpenNew(True)
myDoc.Write(myHTMLString)
myTables = myDoc.GetElementsByTagName("table")
myTable = myTables.Item(0)
For Each child As HtmlElement In myTable.Children
child.OuterText = child.InnerText
Next
myAllTags = myDoc.GetElementsByTagName("html")
myHTMLTag = myAllTags.Item(0)
WriteAllText("C:\Users\Geoffrey Van Wyk\Desktop\myPage2.txt", myHTMLTag.OuterHtml)
End Sub
End Class
我测试了它。它有效。
答案 1 :(得分:2)
input = Regex.Replace(input, @"<table>(.|\n)*?</table>", string.Empty, RegexOptions.Singleline);
这里输入的是包含html的字符串。此正则表达式将删除起始表和结束/表标记之间的所有标记和文本。试试吧!!!