我已经在vba中编写了一些代码,以便从网页获取指向下一页的所有链接。下一页链接的最大数量是255.运行我的脚本,我获得6906链接中的所有链接。这意味着循环一次又一次地运行,我覆盖了东西。过滤掉重复的链接我可以看到有254个唯一的链接。我的目标不是将最高页码硬编码到链接以进行迭代。以下是我正在尝试的内容:
Sub YifyLink()
Const link = "https://www.yify-torrent.org/search/1080p/"
Dim http As New XMLHTTP60, html As New HTMLDocument, htm As New HTMLDocument
Dim x As Long, y As Long, item_link as String
With http
.Open "GET", link, False
.send
html.body.innerHTML = .responseText
End With
For Each post In html.getElementsByClassName("pager")(0).getElementsByTagName("a")
If InStr(post.innerText, "Last") Then
x = Split(Split(post.href, "-")(1), "/")(0)
End If
Next post
For y = 0 To x
item_link = link & "t-" & y & "/"
With http
.Open "GET", item_link, False
.send
htm.body.innerHTML = .responseText
End With
For Each posts In htm.getElementsByClassName("pager")(0).getElementsByTagName("a")
I = I + 1: Cells(I, 1) = posts.href
Next posts
Next y
End Sub
链接所在的元素:
<div class="pager"><a href="/search/1080p/" class="current">1</a> <a href="/search/1080p/t-2/">2</a> <a href="/search/1080p/t-3/">3</a> <a href="/search/1080p/t-4/">4</a> <a href="/search/1080p/t-5/">5</a> <a href="/search/1080p/t-6/">6</a> <a href="/search/1080p/t-7/">7</a> <a href="/search/1080p/t-8/">8</a> <a href="/search/1080p/t-9/">9</a> <a href="/search/1080p/t-10/">10</a> <a href="/search/1080p/t-11/">11</a> <a href="/search/1080p/t-12/">12</a> <a href="/search/1080p/t-13/">13</a> <a href="/search/1080p/t-14/">14</a> <a href="/search/1080p/t-15/">15</a> <a href="/search/1080p/t-16/">16</a> <a href="/search/1080p/t-17/">17</a> <a href="/search/1080p/t-18/">18</a> <a href="/search/1080p/t-19/">19</a> <a href="/search/1080p/t-20/">20</a> <a href="/search/1080p/t-21/">21</a> <a href="/search/1080p/t-22/">22</a> <a href="/search/1080p/t-23/">23</a> <a href="/search/1080p/t-2/">Next</a> <a href="/search/1080p/t-255/">Last</a> </div>
我得到的结果(部分部分):
about:/search/1080p/t-20/
about:/search/1080p/t-21/
about:/search/1080p/t-22/
about:/search/1080p/t-23/
about:/search/1080p/t-255/
答案 0 :(得分:1)
这个想法应该是在循环中抓取页面并找到要比较的东西,如果不是真的话,然后退出循环。
这可能是,即根据字典检查密钥,或检查元素是否退出,或任何其他可能特定于您的问题的逻辑。
例如,您的问题在于,网站会继续显示后面页面的第255页。所以这是我们的线索。我们可以将属于page(n)的元素与属于page(n-1)的元素进行比较。
例如,如果第256页中的元素与第255页中的元素相同,则退出loop / sub。请参阅下面的示例代码:
Sub yify()
Const mlink = "https://www.yify-torrent.org/search/1080p/t-"
Dim http As New XMLHTTP60, html As New HTMLDocument
Dim post As Object, posts As Object
Dim pageno As Long, rowno As Long
pageno = 1
rowno = 1
Do
With http
.Open "GET", mlink & pageno & "/", False
.send
html.body.innerHTML = .responseText
End With
Set posts = html.getElementsByClassName("mv")
If Cells(rowno, 1) = posts(17).getElementsByTagName("a")(0).innerText Then Exit Do
For Each post In posts
With post.getElementsByTagName("div")
If .Length Then
rowno = rowno + 1
Cells(rowno, 1) = .Item(0).innerText
End If
End With
Next post
Debug.Print "pageno: " & pageno & " completed."
pageno = pageno + 1
Loop
End Sub