HttpWebRequest和HtmlAgilityPack都无法从表

时间:2015-07-02 17:20:00

标签: vb.net encoding character-encoding httpwebrequest html-agility-pack

这是我为获取网页的Html而编写的原始函数,并使用与“IE.document”相同的代码进行解析

代码适用于某些网站,但现在我在“doc.write”上收到错误,我认为这是因为网页在表格的第二列中有“iso-8859-1”编码和不同的编码我试图解析。

Function mWebRe(ByVal mUrl As String) As MSHTML.HTMLDocument
    Dim request As HttpWebRequest = WebRequest.Create(mUrl)
    request.Timeout = 10000
    Dim doc As MSHTML.IHTMLDocument2 = New MSHTML.HTMLDocument
    Try
        Dim response As HttpWebResponse = request.GetResponse()
        'this is the original code
        'Dim reader As StreamReader = New StreamReader(response.GetResponseStream())

        'this is an attempt without effects
        Dim reader As StreamReader = New StreamReader(response.GetResponseStream(), Encoding.GetEncoding("iso-8859-1")) 
        Dim WebContent As String = reader.ReadToEnd() 'Here the text seems to be
        doc.clear()
        doc.write(WebContent) 'Here I get error on loading page 
        doc.close()

        ' The following is a must do, to make sure that the data is fully load.
        While (doc.readyState <> "complete")
            Thread.Sleep(50)
        End While

    Catch ex As Exception
        Return Nothing
    End Try
    Return doc
End Function

我尝试修改代码,并尝试使用HtmlAgilityPack(之前从未使用过),但没有成功。

我需要第二个“Table”的内容(没有id),所以我编写了下面的代码(它无法从单元格中获取正确的innertext):

    Dim web As HtmlAgilityPack.HtmlWeb = New HtmlWeb()
    web.OverrideEncoding = Encoding.GetEncoding("ISO-8859-1")
    Dim doc As HtmlAgilityPack.HtmlDocument = web.Load(mUrl)

    For Each Table As HtmlNode In doc.DocumentNode.SelectNodes("//table") 
        For Each Row As HtmlNode In Table.SelectNodes("//tr")
            For Each Cell As HtmlNode In Row.SelectNodes("//td")
                Dim mTxt As String = Cell.InnerText
            Next

        Next
    Next

这是网页源代码的“开始”:

<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html  PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

这是我要提取的行的摘录:

<tr>
<td class="tableValues" align="center" valign="top" >Mar 24/12/2013</td>
<td class="tableValues" align="left" valign="top" >&#73;sc&#114;it&#116;&#111; &#97;&#108; &#82;u&#111;&#108;<!--span-->&#111;<!--i>&#52;</i--></td>
<td class="tableValues" align="left" valign="top" ></td>
</tr>

我认为第二列有不同的编码,但我不知道如何将其转换为正确的文本。 任何建议都表示赞赏。

1 个答案:

答案 0 :(得分:0)

我刚刚解决了使用htmlAgilityPack在代码中插入下面的代码。 但如果有人能提出更好的解决方案,我将感激不尽。

            For Each Cell As HtmlNode In Row.SelectNodes("//td")
                Dim mTxt As String = Cell.InnerText
                If mTxt.Contains("&#") Then
                    Dim StrOk As String = WebUtility.HtmlDecode(mTxt)
                    StrOk = Regex.Replace(StrOk, "<!--.+?-->", String.Empty)
                    Debug.Print(StrOk)
                End If