从第一个搜索结果中抓取网站URL

时间:2020-03-12 11:18:43

标签: html excel vba internet-explorer firefox

我目前正在使用VBA代码在Excel项目中进行工作,以使本来几乎是不可能的手动任务自动化。我有一个包含约25,000个公司关键字的电子表格,我想从中获取公司网站的URL。因此,我希望运行一个VBA脚本,该脚本可以将这些关键字作为Google搜索来运行,并将第一个结果的URL提取到电子表格中。我想指出的是,我是使用此类代码的初学者,因此可能是我忽略了其他人看似简单的东西。

我测试了similar thread中的代码,但发现结果很容易出错。一些关键字将在下一列中返回该URL,而其他关键字将保留为空白。似乎还会在第一个搜索结果中拉出Google优化子链接的URL,而不是主要网站URL(请参阅下面的参考资料):

Google Search Result example

然后,我找到了下面的代码here,该代码是我在1000个关键字的示例列表中运行的。该博客的作者规定此代码适用于Mozilla Firefox。我测试了他也编写的IE代码,但是没有达到相同的结果(它添加了由搜索结果中的描述性文本而非原始URL组成的超链接)。 Firefox代码(我已插入此代码供参考)在第714行之前一直正常工作,该宏在该行返回了错误消息“ 运行时错误91:对象变量或未设置块变量 ”。

Spreadsheet layout showing successful results and row at which macro stopped

对于宏为什么会在此时停止以及如何调整此代码以避免将来的错误,我将不胜感激。

Sub GoogleURL ()

    Dim url As String, lastRow As Long

    Dim XMLHTTP As Object

    Dim html As Object

    Dim objResultDiv As Object

    Dim objH As Object

    lastRow = Range(“A” & Rows.Count).End(xlUp).Row

    For i = 2 To lastRow

        url = “https://www.google.co.uk/search?q=” & Cells(i, 1) & “&rnd=” & WorksheetFunction.RandBetween(1, 10000)

        Set XMLHTTP = CreateObject(“MSXML2.serverXMLHTTP”)

        XMLHTTP.Open “GET”, url, False

        XMLHTTP.setRequestHeader “Content-Type”, “text/xml”

        XMLHTTP.setRequestHeader “User-Agent”, “Mozilla/5.0 (Windows NT 6.1; rv:25.0) Gecko/20100101 Firefox/25.0”

        XMLHTTP.send

        Set html = CreateObject(“htmlfile”)

        html.body.innerHTML = XMLHTTP.ResponseText

        Set objResultDiv = html.getelementbyid(“rso”)

        Set objH = objResultDiv.getelementsbytagname(“h3”)(0)

        Cells(i, 2).Value = objH.innerText

        Set html = CreateObject(“htmlfile”)

        html.body.innerHTML = XMLHTTP.ResponseText

        Set objResultDiv = html.getelementbyid(“rso”)

        Set objH = objResultDiv.getelementsbytagname(“cite”)(0)

        Cells(i, 3).Value = objH.innerText

        DoEvents

    Next

End Sub

1 个答案:

答案 0 :(得分:1)

由于Firefox是Microsoft支持范围的第三方浏览器,因此我可以帮助您检查IE浏览器的VBA代码。

您说过this link中为IE浏览器提供的VBA代码生成带有链接的描述,而您的要求是将描述和链接存储在单独的列中。

我尝试根据您的要求修改该示例代码。

这是该示例中的修改后的代码。

Option Explicit
Const TargetItemsQty = 1 ' results for each keyword

Sub GWebSearchIECtl()

    Dim objSheet As Worksheet
    Dim objIE As Object
    Dim x As Long
    Dim y As Long
    Dim strSearch As String
    Dim lngFound As Long
    Dim st As String
    Dim colGItems As Object
    Dim varGItem As Variant
    Dim strHLink As String
    Dim strDescr As String
    Dim strNextURL As String

    Set objSheet = Sheets("Sheet1")
    Set objIE = CreateObject("InternetExplorer.Application")
    objIE.Visible = True ' for debug or captcha request cases
    y = 1 ' start searching for the keyword in the first row
    With objSheet
        .Select
        .Range(.Columns("B:B"), .Columns("B:B").End(xlToRight)).Delete ' clear previous results
        .Range(.Columns("C:C"), .Columns("C:C").End(xlToRight)).Delete ' clear previous results
        .Range("A1").Select
        Do Until .Cells(y, 1) = ""
            x = 2 ' start writing results from column B
            .Cells(y, 1).Select
            strSearch = .Cells(y, 1) ' current keyword
            With objIE
                lngFound = 0
                .navigate "https://www.google.com/search?q=" & EncodeUriComponent(strSearch) ' go to first search results page
                Do
                    Do While .Busy Or Not .READYSTATE = 4: DoEvents: Loop ' wait IE
                    Do Until .document.READYSTATE = "complete": DoEvents: Loop ' wait document
                    Do While TypeName(.document.getelementbyid("res")) = "Null": DoEvents: Loop ' wait [#res] element
                    Set colGItems = .document.getelementbyid("res").getElementsByClassName("g") ' collection of search result [.g] items
                    For Each varGItem In colGItems ' process each item in collection
                        If varGItem.getelementsbytagname("a").Length > 0 And varGItem.getElementsByClassName("st").Length > 0 Then ' must have hyperlink and description
                            strHLink = varGItem.getelementsbytagname("a")(0).href ' get first hyperlink [a] found in current item
                            strDescr = GetInnerText(varGItem.getElementsByClassName("st")(0).innerHTML) ' get first description [span.st] found in current item
                            lngFound = lngFound + 1
                            'Debug.Print (strHLink)
                            'Debug.Print (strDescr)
                            With objSheet ' put result into cell
                                 .Cells(y, x).Value = strDescr
                                 .Hyperlinks.Add .Cells(y, x + 1), strHLink
                                .Cells(y, x).WrapText = True
                                x = x + 1 ' next column
                            End With
                            If lngFound = TargetItemsQty Then Exit Do ' continue with next keyword - necessary quantity of the results for current keyword found
                        End If
                        DoEvents
                    Next
                    If TypeName(.document.getelementbyid("pnnext")) = "Null" Then Exit Do ' continue with next keyword - no [a#pnnext.pn] next page button exists
                    strNextURL = .document.getelementbyid("pnnext").href ' get next page url
                    .navigate strNextURL ' go to next search results page
                Loop
            End With
            y = y + 1 ' next row
        Loop
    End With
    objIE.Quit

    ' google web search page contains the elements:
    ' [div#res] - main search results block
    ' [div.g] - each result item block within [div#res]
    ' [a] - hyperlink ancor(s) within each [div.g]
    ' [span.st] - description(s) within each [div.g]
    ' [a#pnnext.pn] - hyperlink ancor to the next search results page

End Sub

Function EncodeUriComponent(strText As String) As String
    Static objHtmlfile As Object

    If objHtmlfile Is Nothing Then
        Set objHtmlfile = CreateObject("htmlfile")
        objHtmlfile.parentWindow.execScript "function encode(s) {return encodeURIComponent(s)}", "jscript"
    End If
    EncodeUriComponent = objHtmlfile.parentWindow.encode(strText)
End Function

Function GetInnerText(strText As String) As String
    Static objHtmlfile As Object

    If objHtmlfile Is Nothing Then
        Set objHtmlfile = CreateObject("htmlfile")
        objHtmlfile.Open
        objHtmlfile.Write "<body></body>"
    End If
    objHtmlfile.body.innerHTML = strText
    GetInnerText = objHtmlfile.body.innerText
End Function

在IE 11浏览器中的输出:

enter image description here

您可以尝试自己运行它,以查看包含大量数据的结果。

如果遇到任何性能问题,建议您使用较少的数据量进行尝试。