这是我为获取网页的Html而编写的原始函数,并使用与“IE.document”相同的代码进行解析
代码适用于某些网站,但现在我在“doc.write”上收到错误,我认为这是因为网页在表格的第二列中有“iso-8859-1”编码和不同的编码我试图解析。
Function mWebRe(ByVal mUrl As String) As MSHTML.HTMLDocument
Dim request As HttpWebRequest = WebRequest.Create(mUrl)
request.Timeout = 10000
Dim doc As MSHTML.IHTMLDocument2 = New MSHTML.HTMLDocument
Try
Dim response As HttpWebResponse = request.GetResponse()
'this is the original code
'Dim reader As StreamReader = New StreamReader(response.GetResponseStream())
'this is an attempt without effects
Dim reader As StreamReader = New StreamReader(response.GetResponseStream(), Encoding.GetEncoding("iso-8859-1"))
Dim WebContent As String = reader.ReadToEnd() 'Here the text seems to be
doc.clear()
doc.write(WebContent) 'Here I get error on loading page
doc.close()
' The following is a must do, to make sure that the data is fully load.
While (doc.readyState <> "complete")
Thread.Sleep(50)
End While
Catch ex As Exception
Return Nothing
End Try
Return doc
End Function
我尝试修改代码,并尝试使用HtmlAgilityPack(之前从未使用过),但没有成功。
我需要第二个“Table”的内容(没有id),所以我编写了下面的代码(它无法从单元格中获取正确的innertext):
Dim web As HtmlAgilityPack.HtmlWeb = New HtmlWeb()
web.OverrideEncoding = Encoding.GetEncoding("ISO-8859-1")
Dim doc As HtmlAgilityPack.HtmlDocument = web.Load(mUrl)
For Each Table As HtmlNode In doc.DocumentNode.SelectNodes("//table")
For Each Row As HtmlNode In Table.SelectNodes("//tr")
For Each Cell As HtmlNode In Row.SelectNodes("//td")
Dim mTxt As String = Cell.InnerText
Next
Next
Next
这是网页源代码的“开始”:
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
这是我要提取的行的摘录:
<tr>
<td class="tableValues" align="center" valign="top" >Mar 24/12/2013</td>
<td class="tableValues" align="left" valign="top" >Iscritto al Ruol<!--span-->o<!--i>4</i--></td>
<td class="tableValues" align="left" valign="top" ></td>
</tr>
我认为第二列有不同的编码,但我不知道如何将其转换为正确的文本。 任何建议都表示赞赏。
答案 0 :(得分:0)
我刚刚解决了使用htmlAgilityPack在代码中插入下面的代码。 但如果有人能提出更好的解决方案,我将感激不尽。
For Each Cell As HtmlNode In Row.SelectNodes("//td")
Dim mTxt As String = Cell.InnerText
If mTxt.Contains("&#") Then
Dim StrOk As String = WebUtility.HtmlDecode(mTxt)
StrOk = Regex.Replace(StrOk, "<!--.+?-->", String.Empty)
Debug.Print(StrOk)
End If