如何在解析网络数据时摆脱重复链接?

时间:2017-07-26 13:00:13

标签: vba web-scraping web-crawler

我已经在vba中编写了一些脚本来解析通往torrent网站下一页的链接。我的脚本能够刮掉它们。但是,我面临的问题是结果中出现了几个重复的链接。我的问题是,是否有任何技术可以解析唯一的链接?

Sub TorrentData()
    Dim http As New XMLHTTP60, html As New HTMLDocument, post As Object

    With http
        .Open "GET", "https://yts.ag/browse-movies", False
        .send
        html.body.innerHTML = .responseText
    End With

    For Each post In html.getElementsByClassName("tsc_pagination")(0).getElementsByTagName("a")
        If InStr(post, "page") > 0 Then
           x = x + 1: Cells(x, 1) = post.href
        End If
    Next post
End Sub

抓取链接的部分图片:

enter image description here

在继续操作之前,请务必检查链接: " https://www.dropbox.com/s/647x3m65u90a1bu/Description1.txt?dl=0"

1 个答案:

答案 0 :(得分:1)

我无法使网站正常运行。无论如何,使用字典来消除重复并写入同一循环内的单元格的正确方法应该如下所示:

For Each Post In html.getElementsByClassName("tsc_pagination")(0).getElementsByTagName("a")
    If InStr(Post.href, "page") > 0 Then
        If Not dict.Exists(Post.href) Then
            dict.Add Post.href, "whatever information you would like to store"
            x = x + 1
            Cells(x, 1) = Post.href
        End If
    End If
Next Post