Question

我已经使用vba和selenium编写了一个脚本，以获取延迟加载网页的底部。但是，我的脚本能够做到这一点。但我在我的脚本中使用的for x循环看起来很奇怪，我没有解释它。我期望做的是使用相同的循环，没有任何硬编码的数字，就像在这种情况下200。对此的任何帮助将受到高度赞赏。

Sub Get_links()
Dim driver As New WebDriver

With driver
    .Start "chrome", "http://fortune.com/fortune500"
    .get "/list/"
End With

For x = 0 To 200
    driver.ExecuteScript "window.scrollTo(0, document.body.scrollHeight);"
    driver.Wait 500
Next x
End Sub

Answer 1

说实话，我真的很喜欢解决/调整你的问题，他们真的很有挑战性。你走了：

Sub Get_links()
Dim driver As New WebDriver
Dim CurrentPageHeight As Long, PrevPageHeight As Long
Dim EndofPage As Boolean

'EndofPage = False
With driver
    .Start "chrome", "http://fortune.com/fortune500"
    .get "/list/"
End With

Do While EndofPage = False
    PrevPageHeight = CurrentPageHeight
    CurrentPageHeight = driver.ExecuteScript("window.scrollTo(0, document.body.scrollHeight);var CurrentPageHeight=document.body.scrollHeight;return CurrentPageHeight;")
    driver.Wait 3000 'depending on your internet connection, increase or decrease time
    If PrevPageHeight = CurrentPageHeight Then
        EndofPage = True
    End If
Loop

End Sub

修改

我认为在VBA中没有隐含或显式等待Selenium，并且没有必要。

在抓取网页时，无论是否是Selenium，我总是选择依赖于页面中的元素是否存在。根据我的个人经历，＆＃34;隐含和明确的等待＆＃34;在抓取时，我在python和vba都失败了。

再次，就个人而言，我发现VBA比python更可靠，更容易，不仅可以用于抓取，还可以将数据提取到excel，因为它们位于同一平台上。这是因为我找到了一个解决方案，以确保我正在抓取我想要的页面（不是以前加载的页面循环）。请检查this post上面提到的解决方案，我无法在网上找到这样的内容。

我可以对python实现相同的功能，但是只有当我打算在api中使用我的解析数据时才会这样做。由于它很出色，VBA是更好的选择。

无论如何，我模仿下面隐含的等待你。我希望它可以让您深入了解您的评论/问题。

Sub Get_links() Dim driver As New WebDriver Dim CurrentPageHeight As Long, NextPageHeight As Long Dim EndofPage As Boolean 'EndofPage = False With driver .Start "chrome", "http://fortune.com/fortune500" .get "/list/" End With Do driver.ExecuteScript "window.scrollTo(0, document.body.scrollHeight);" On Error Resume Next Debug.Print Split(driver.FindElementsByClass("company-list")(1).Text, vbLf)(3001) Loop Until Err.Number <> 9 End Sub

Edit2：使用Debug.Print Split(driver.FindElementsByClass("company-list")(1).Text, vbLf)(3001)的原因是检查属于页面底部的元素（如果存在或不存在）。这个短语没有什么特别之处，你可以使用类似的东西，只要你可以从底部返回一个元素。让我解释一下我的逻辑：

如果你debug.print driver.FindElementsByClass("company-list")(1).Text，你会看到这是由换行符分隔的完整列表。

所以我将它们与vbLf分开，并在列表中排名1000，这是第3001个元素。我怎么知道这个？用简单快捷的逻辑：

...(1).Text, vbLf)(0) -> RANK ...(1).Text, vbLf)(1) -> COMPANY ...(1).Text, vbLf)(2) -> REVENUES ($M) ...(1).Text, vbLf)(3) -> 1 ...(1).Text, vbLf)(4) -> Walmart ...(1).Text, vbLf)(5) -> $485,873 ...(1).Text, vbLf)(6) -> 2 . . (Rank 1) * 3 = (3) (Rank 2) * 3 = (6) . . . (Rank 1000) * 3 = (3000)

你应该从（3000）获得等级1000，但是你不会因为在列表的第20行之后有另一个div。所以它是（3001）。你可以使用3000,2950,2912，无论你喜欢什么，只要他们在最后50组。

如何找到纠正现有循环的网页底部？

1 个答案: