无法摆脱脚本中的硬编码延迟

时间:2018-05-18 21:07:14

标签: vba excel-vba selenium web-scraping selenium-chromedriver

我在vba中结合selenium编写了一个脚本来解析网页中可用的所有公司名称。该网页具有活动的延迟加载方法,因此每个滚动中只有20个链接可见。如果我滚动2次,则可见的链接数为40,依此类推。该网页中有1000个链接。我的下面的脚本可以到达该页面的底部,处理所有滚动并获取该网页中可用的所有名称。

但是,必须在每次滚动该网页后等待一段时间才能更新内容。这是我使用hardcoded delay的地方,但硬编码的过程非常不一致,有时它会使浏览器在整个操作完成之前退出。

如何修改此部分.Wait 6000以使其Explicit Wait而不是Hardcoded Wait

这是我到目前为止所写的:

Sub Getlinks()
    Dim driver As New ChromeDriver, prevlen&, curlen&
    Dim posts As Object, post As Object

    With driver
        .get "http://fortune.com/fortune500/list/"
        prevlen = .FindElementsByClass("company-title").Count

        Do
            prevlen = curlen
            .ExecuteScript ("window.scrollTo(0, document.body.scrollHeight);")

            .Wait 6000  ''I like to kick out this hardcoded delay and use explicit wait in place

            Set posts = .FindElementsByClass("company-title")
            curlen = posts.Count
            If prevlen = curlen Then Exit Do
        Loop

        For Each post In posts
            R = R + 1: Cells(R, 1) = post.Text
        Next post
    End With
End Sub

6 个答案:

答案 0 :(得分:4)

这是一种完全不同的方法,不需要使用浏览器,而是提交一系列Web请求。使用这种方法,等待页面加载不是一个问题。

通常,对于延迟加载页面,它会在您滚动时提交一个新请求以加载页面的数据。如果您监控网络流量,您可以发现所提出的请求并模拟这些请求,我已在下面进行了此操作。

结果应该是公司名称列表,按照第一张Excel的升序排列。

您需要的东西:

将引用添加到:

  • Microsoft Scripting Runtime
  • Microsoft XML v6.0
  • 将VBA-JSON代码添加到项目中。你可以找到here

修改

更改代码以继续从网站提取数据,直到列表中没有其他项目为止。感谢@Qharr指出这一点。

<强>代码

Public Sub SubmitRequest()
    Const baseURL As String = "http://fortune.com/api/v2/list/2358051/expand/item/ranking/asc/"

    Dim Url            As String
    Dim startingNumber As Long
    Dim j              As Long
    Dim getRequest     As MSXML2.XMLHTTP60
    Dim Json           As Object
    Dim Companies      As Object
    Dim Company        As Variant
    Dim CompanyArray   As Variant

    'Create an array to hold each company
    ReDim CompanyArray(0 To 50000)
    'Create a new XMLHTTP object so we can place a get request
    Set getRequest = New MSXML2.XMLHTTP60

    'The api seems to only support returning 100 records at a time
    'So do in batches of 100
    Do
        'Build the url, the format is something like
        '0/100, where 0 is the starting position, and 100 is the ending position
        Url = baseURL & startingNumber & "/" & startingNumber + 100

        With getRequest
            .Open "GET", Url
            .send

            'The response is a JSON object, for this code to work -
            'You'll need this code https://github.com/VBA-tools/VBA-JSON
            'What is returned is a dictionary
            Set Json = JsonConverter.ParseJson(.responseText)
            Set Companies = Json("list-items")

            'Keep checking in batches of 100 until there are no more
            If Companies.Count = 0 Then Exit Do

            'Iterate the dictionary and return the title (which is the name)
            For Each Company In Companies
                CompanyArray(j) = Company("title")
                j = j + 1
            Next

        End With
        startingNumber = startingNumber + 100
   Loop

    ReDim Preserve CompanyArray(j - 1)

    'Dump the data to the first sheet
    ThisWorkbook.Sheets(1).Range("A1:A" & j) = WorksheetFunction.Transpose(CompanyArray)

End Sub

答案 1 :(得分:3)

你去了:

Sub Getlinks()
    Dim driver As New ChromeDriver
    Dim pcount As Long, R as long
    Dim posts As Object, post As Object

    With driver
        .get "http://fortune.com/fortune500/list/"
        Do
            .ExecuteScript ("window.scrollTo(0, document.body.scrollHeight);")
            Set posts = .FindElementsByClass("company-title")
            pcount = posts.Count
        Loop Until pcount = 1000

        For Each post In posts
            R = R + 1: Cells(R, 1) = post.Text
        Next post
    End With
End Sub

甚至更好,随时打印:

Sub Getlinksasyougo()
    Dim driver As New ChromeDriver
    Dim pcount As Long, R As Long, i As Long
    Dim posts As Object, post As Object


    With driver
        .get "http://fortune.com/fortune500/list/"
        i = 1
        Do
            .ExecuteScript ("window.scrollTo(0, document.body.scrollHeight);")
            Set posts = .FindElementsByClass("company-title")
            pcount = posts.Count
            If i <> pcount Then
                For R = i To pcount - 1
                    Cells(R, 1) = posts(R + 1).Text
                Next R
                i = pcount
            End If
        Loop Until pcount = 1000

    End With
End Sub

答案 2 :(得分:2)

这是一种使用其中一条评论中讨论的“查找微调元素”方法来处理它的方法,这有助于您避免必须指定您希望页面加载的元素数量。微调器的类名实际上会根据它是否可见而改变,这使得在获取页面元素之前等待微调器变得可见+再次消失非常容易。

这种方法还需要一些等待;默认情况下,它会在每次尝试查找微调器后等待1/10秒,直到找到微调器或尝试最大次数。但这比每次等待5秒要快得多。

另外,不相关,但不要一次一个地写东西,它真的很慢。首先将它写入数组并一次写入整个数组要快得多。

Sub getLinks()

    Dim bot As New ChromeDriver
    bot.Get "http://fortune.com/fortune500/list/"

    Dim posts As WebElements
    Dim numPosts As Long
    Dim finishedScrolling As Boolean
    finishedScrolling = False
    Do Until finishedScrolling
        'Set beginning post count and scroll down
        Dim startPosts As Long
        startPosts = numPosts
        bot.ExecuteScript "window.scrollTo(0, document.body.scrollHeight);"

        'Wait for spinner to become visible, then wait for up to 5 seconds for rehide
        Call waitForElements(bot, "div[class^='F500-spinner  ']", 50)
        Call waitForElements(bot, "div[class^='F500-spinner hide']", 50)

        'See if any new posts have loaded
        Set posts = bot.FindElementsByClass("company-title")
        numPosts = posts.Count
        If numPosts = startPosts Then
            finishedScrolling = True
        End If
    Loop

    'Write text to results array
    Dim post As WebElement
    ReDim resultsArr(1 To posts.Count, 1 To 1) As String
    Dim i As Long
    i = 1
    For Each post In posts
        resultsArr(i, 1) = post.Text
        i = i + 1
    Next

    'Write array to sheet
    With ActiveSheet
        .Range(.Cells(1, 1), .Cells(UBound(resultsArr, 1), 1)).Value = resultsArr
    End With

End Sub
Sub waitForElements(bot As WebDriver, css As String, maxAttempts As Long, Optional waitTimeMS As Long = 100)
'Use a CSS selector string to wait for element(s) to appear on a page or to reach max number of attempts
'By default, bot waits 0.1 second after each attempt

    Dim i As Long
    Dim foundElem As Boolean
    foundElem = False
    Do Until foundElem
        i = i + 1
        If bot.FindElementsByCss(css).Count > 0 Then
            foundElem = True
        ElseIf i = maxAttempts Then
            foundElem = True
        Else
            bot.Wait waitTimeMS
        End If
    Loop

End Sub

答案 3 :(得分:1)

我不知道这是否有用,因为它仍然是一个“硬编码”的解决方案,但您可以尝试延迟功能而不是等待功能,看看这是否有助于程序退出问题。

#include <stdio.h>
#include <string.h>

void create_numbers(int number){
int i;
int rotation_count=0;
int array[number];
int 

    for (i=1; i<= number; i++){
        array[i] = i;
        if (i == number){
            for (i=number-1; i >=1; i--) {
                printf("%d", i);
                if (i == 1) {
                    printf("\n");
                    return ;
                }
            }
        }
    }
}

void main()
{
    int a = 1, i, remainder, rotate, max_rotation;
    while (a == 1){
        int number; 
        printf("give me a number: \n");
        scanf("%d", &number);
        remainder = number%2;
        if (remainder != 0) {
            for (i=1; i<number; i++){
                create_numbers(number);
            }
        }
    }   
}   

答案 4 :(得分:1)

定义超时(允许经过的指定时间段)以消除硬编码延迟。超时需要硬编码。

此代码与原始代码之间存在差异:

  • 循环本身一遍又一遍地运行(每次迭代都不会等待6秒)并检查新内容,直到找到新内容或达到超时。
  • 如果延迟加载花费的时间超过预期,例如加载数字21到50时,循环等等#34;等等。并尝试获取超时中定义的最长时间的新内容。
  • 下行:在加载所有内容的最后一步,循环将花费设置超时的秒数。

代码:

Sub Getlinks()
    Dim driver As New ChromeDriver, prevlen&, curlen&
    Dim posts As Object, post As Object
    Dim timeout As Integer, startTime As Double

    timeout = 10 ' set the timeout to 10 seconds

    With driver
        .get "http://fortune.com/fortune500/list/"
        prevlen = .FindElementsByClass("company-title").Count

        startTime = Timer ' set the initial starting time

        Do
            .ExecuteScript ("window.scrollTo(0, document.body.scrollHeight);")
            Set posts = .FindElementsByClass("company-title")
            curlen = posts.Count
            If curlen > prevlen Then
                startTime = Timer ' reset start time if new elements found
                prevlen = curlen ' set new prevlen
            End If
        Loop While Round(Timer - startTime, 2) <= timeout ' check if timeout is reached

        For Each post In posts
            R = R + 1: Cells(R, 1) = post.Text
        Next post
    End With
End Sub

答案 5 :(得分:0)

我想你差点就到了。

虽然我不认为你可以避免等待,但是当你向下滚动等待较短的时间时,解决方法是多次检查新帖子。

以下示例是每次检查新帖子5次,等待2秒,因此在宣布页面结束之前总共需要10秒。调整这两个参数以适应。

Sub Getlinks()
    Dim driver As New ChromeDriver, prevlen&, curlen&
    Dim posts As Object, post As Object
    ' Counter for number of times when there are NO NEW POSTS
    Dim NoIncreaseCount As Integer
    Const MaxNoIncreaseCount As Integer = 5
    Const WaitTime As Integer = 2000 ' 2 seconds wait time each scroll down

    With driver
        .get "http://fortune.com/fortune500/list/"
        prevlen = .FindElementsByClass("company-title").Count
        NoIncreaseCount = 0
        Do Until NoIncreaseCount = MaxNoIncreaseCount
            .ExecuteScript ("window.scrollTo(0, document.body.scrollHeight);")
            .Wait WaitTime
            Set posts = .FindElementsByClass("company-title")
            curlen = posts.Count
            If prevlen < curlen Then
                ' There are new Posts
                prevlen = curlen
                NoIncreaseCount = 0
            Else
                ' No new Posts
                NoIncreaseCount = NoIncreaseCount + 1
            End If
        Loop

        For Each post In posts
            R = R + 1: Cells(R, 1) = post.Text
        Next post
    End With
End Sub