使用VB

时间:2016-03-22 19:56:30

标签: vb.net web-scraping

我是VB.NET的新手,目前正在学习如何抓取和解析网站。我的问题简而言之 - 如果我在代码中多次使用“getElementsByClassName”,它将只在第一次工作。与“getElementsByTagName”相同的情况。即使我只是手动解析HTML代码,它也只能在第一次工作。

以下是使用“getElementsByClassName”的示例。我有Form1与Button 1和ListBox1。我试图从两个网站(谷歌和BBC)获取新闻标题,然后将它们放入ListBox1。你可以看到我将我的代码分成两部分。我想指出两个部分都能很好地工作并获得我需要的信息,但仅限于单独使用时。当像下面的例子一样放在一起时,第一部分(谷歌)将执行没有问题,但第二部分(BBC)将在线上给我一个错误“Dim AllItemsBBC As Object = SecondBrowser.Document.getElementsByClassName(”title-link__title-文本“)”。

现在更有意思的是,如果我翻转代码并将BBC部分放在第一位且谷歌第二位,BBC将毫无问题地执行,Google会在线上给我错误“Dim AllItemsGoogle As Object = FirstBrowser.Document.getElementsByClassName(” titletext“)”。基本上无论哪个首先执行都没有问题,第二个失败。

错误消息显示“Microsoft.VisualBasic.dll中发生了'System.NotSupportedException'类型的未处理异常附加信息:来自HRESULT的异常:0x800A01B6”。

例1:

Public Class Form1

    Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click

        'START  OF PART 1
        'Creating and navigating the IE browser to Google news page
        Dim FirstBrowser As Object = CreateObject("InternetExplorer.Application")
        FirstBrowser.Visible = True
        FirstBrowser.Navigate("https://news.google.com/news?cf=all&pz=1&ned=us")
        Do
            Application.DoEvents()
        Loop Until FirstBrowser.readyState = 4

        'Getting the titles from Google news page and adding them to ListBox1
        Dim ItemGoogle As Object
        Dim AllItemsGoogle As Object = FirstBrowser.Document.getElementsByClassName("titletext")
        For Each ItemGoogle In AllItemsGoogle
            ListBox1.Items.Add(ItemGoogle.InnerText)
        Next ItemGoogle

        'Closing the browser
        FirstBrowser.Quit()
        'END OF PART1

        'START  OF PART 2
        'Creating and navigating the IE browser to BBC news page
        Dim SecondBrowser As Object = CreateObject("InternetExplorer.Application")
        SecondBrowser.Visible = True
        SecondBrowser.Navigate("http://www.bbc.com/news")
        Do
            Application.DoEvents()
        Loop Until SecondBrowser.readyState = 4

        'Getting the titles from BBC news page and adding them to ListBox1
        Dim ItemBBC As Object
        Dim AllItemsBBC As Object = SecondBrowser.Document.getElementsByClassName("title-link__title-text")
        For Each ItemBBC In AllItemsBBC
            ListBox1.Items.Add(ItemBBC.InnerText)
        Next ItemBBC

        'Closing the browser
        SecondBrowser.Quit()
        'END OF PART 2

    End Sub
End Class

我的第二个例子是我通过基本上只找到我需要的短语来解析相同的网站。同样的情况,谷歌部分工作,BBC失败就行“Dim the_html_code_bbc As String = SecondBrowser.Document.Body.InnerHTML”。

将其翻转,BBC正常工作,Google就失败了“Dim the_html_code_google As String = FirstBrowser.Document.Body.InnerHTML”。

错误消息显示“Microsoft.VisualBasic.dll中发生了'System.MissingMemberException'类型的未处理异常附加信息:未找到类型'JScriptTypeInfo'上的公共成员'InnerHTML'。”

示例2

Public Class Form1

    Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click

        'START  OF PART 1
        'Creating and navigating the IE browser to Google news page
        Dim FirstBrowser As Object = CreateObject("InternetExplorer.Application")
        FirstBrowser.Visible = True
        FirstBrowser.Navigate("https://news.google.com/news?cf=all&pz=1&ned=us")
        Do
            Application.DoEvents()
        Loop Until FirstBrowser.readyState = 4

        'Getting the titles from Google news page and adding them to ListBox1
        Dim the_html_code_google As String = FirstBrowser.Document.Body.InnerHTML
        Dim start_of_code_google As String
        Dim code_selection_google As String
        Do
            Application.DoEvents()
            start_of_code_google = InStr(the_html_code_google, "titletext")
            If start_of_code_google > 0 Then
                code_selection_google = Mid(the_html_code_google, start_of_code_google + 11, Len(the_html_code_google))
                the_html_code_google = Mid(the_html_code_google, start_of_code_google + 11, Len(the_html_code_google))
                code_selection_google = Mid(code_selection_google, 1, InStr(code_selection_google, Chr(60)) - 1)
                ListBox1.Items.Add(code_selection_google)
            End If
        Loop Until start_of_code_google = 0

        'Closing the browser
        FirstBrowser.Quit()
        'END OF PART1


        'START  OF PART 2
        'Creating and navigating the IE browser to BBC news page
        Dim SecondBrowser As Object = CreateObject("InternetExplorer.Application")
        SecondBrowser.Visible = True
        SecondBrowser.Navigate("http://www.bbc.com/news")
        Do
            Application.DoEvents()
        Loop Until SecondBrowser.readyState = 4

        'Getting the titles from BBC news page and adding them to ListBox1
        Dim the_html_code_bbc As String = SecondBrowser.Document.Body.InnerHTML
        Dim start_of_code_bbc As String
        Dim code_selection_bbc As String
        Do
            Application.DoEvents()
            start_of_code_bbc = InStr(the_html_code_bbc, "title-link__title-text")
            If start_of_code_bbc > 0 Then
                code_selection_bbc = Mid(the_html_code_bbc, start_of_code_bbc + 24, Len(the_html_code_bbc))
                the_html_code_bbc = Mid(the_html_code_bbc, start_of_code_bbc + 24, Len(the_html_code_bbc))
                code_selection_bbc = Mid(code_selection_bbc, 1, InStr(code_selection_bbc, Chr(60)) - 1)
                ListBox1.Items.Add(code_selection_bbc)
            End If
        Loop Until start_of_code_bbc = 0

        'Closing the browser
        SecondBrowser.Quit()
        'END OF PART 2

    End Sub
End Class

另一件值得一提的是,如果我使用一种解析Google部分的方法和一种不同的BBC方法,那么一切都很有效。

由于我对Visual Studio缺乏经验,我一定错过了什么。我正在使用Express 2013 for Windows Desktop版本。如果您知道导致此问题的原因,我将非常感谢您的建议。

0 个答案:

没有答案