VBA Code Scraper不在右列中放置数据

时间:2019-05-25 12:33:37

标签: excel vba

该代码可以正常工作,但是我只需要提取电子邮件和URL并将其放置在Sheet1“ Scraper” NEXT BLANK ROW

Emails =  Column A
Urls =  Column B

当前,它提取任何文本,电子邮件或URL,并将它们放置在column AB中。

我只需要电子邮件或URL。我已经坚持了一段时间,似乎无法解决

我也不确定我的DELETE DUPLICATES是否要删除重复的行或列中的重复项。它应该是重复的行。

代码的工作方式:

Sheet2“ URL列表”上,我有一个URL列表,代码通过它运行并将结果放置在Sheet1“ Scraper”上。并删除所有重复项

仅应抓取电子邮件和URL,并将它们放置在NEXT BLANK ROW上的Column AB中。

我试图解决此问题,但超出了我的范围。

Private Sub fbStart_Click()
'Set sheet2 URL List and open Internet Explorer
    Dim lr          As Long
    Dim x           As Long
    Dim arr()       As Variant
    Dim wks         As Worksheet
    Dim ie          As Object
    Dim dd(1 To 2)  As String
    Dim Fr          As Long

    On Error Resume Next
    Application.ScreenUpdating = False

    Set wks = ThisWorkbook.Sheets("Url List")
    With wks
        Fr = .Cells(.Rows.Count, 6).End(xlUp).Offset(1).Row
        lr = .Cells(.Rows.Count, 1).End(xlUp).Row
        .Cells(1, 5).Value = lr
        arr = .Range(.Cells(Fr, 1), .Cells(lr, 1)).Value
    End With

    'Show Internet Explorer and add delay in seconds if needed
    Set ie = CreateObject("InternetExplorer.Application")
    With ie
        .Visible = True
        Application.Wait Now + TimeValue("0:00:0")

        For x = LBound(arr, 1) To UBound(arr, 1)
            .navigate arr(x, 1)
            wtime = Time
            Do While .Busy Or .readyState <> 4
                DoEvents

            'Skip pages with Captchas  + write the word Captcha in Sheet 2 Column C
                If Time > (wtime + TimeValue("00:00:10")) Then
                    Cells(x + 1, "C").Value = "Captcha"
                    Exit Do
                End If
            Loop

            On Error Resume Next
            'Variable for document or data which need to be extracted out of webpage, change innertext number if same class used
            Dim doc As HTMLDocument
            Set doc = ie.document
            dd(1) = doc.getElementsByClassName("_50f4")(2).innerText
            dd(2) = doc.getElementsByClassName("_50f4")(3).innerText


            'Paste the web data into Sheet1 "Scraper" in next BLANK ROW
            With Sheet1
                .Cells(.Rows.Count, "A").End(xlUp).Offset(1, 0).Resize(, 2).Value = dd
            End With

            ' Put A number 1 in Sheet2 "Url List"column B to notify this URL is done
            Sheets("Url List").Range("B" & Rows.Count).End(xlUp).Offset(1, 0).Value = 1

             'Deletes duplicates in column A Sheet1
            Columns(1).RemoveDuplicates Columns:=Array(1)
            Columns(2).RemoveDuplicates Columns:=Array(1)

             'Count No1 in sheet2 Column B
            With Worksheets("Url List")
                Lastrow = .Cells(.Rows.Count, "B").End(xlUp).Row
                Sheets("Url List").Range("B1").Value = Lastrow
            End With
            Call Autoclick_Click
        Next x
       .Quit
    End With

    'Hide FaceBook Scraper Form
    ScraperForm.Hide

End Sub

1 个答案:

答案 0 :(得分:4)

下面是向您展示如何处理查找电子邮件和网站地址的方法。您已经拥有循环并已进行重复数据删除。以下是用于提取所需信息的帮助方法。您可以简单地将变量emailwebsite分配给循环中的单元格。我展示了一种使用辅助函数来确定目标表中的lastRow并一次性写入变量以校正列的方法。

如果需要,我可以帮助实现循环集成,但是这里的重点是解释如何识别感兴趣的元素以及如何写出正确的列。 TBH-重复数据删除很容易在工作表的末尾完成,但是您也可以使用宏记录器为该单步操作/使用现有的SO答案获得完美的功能代码。


tl; dr;

如果允许:contains / :has css伪类,这会容易得多。相反,我的方法如下:

  1. 电子邮件-查找其值以href开头的mailto属性

image

  1. 网站-检查页面上是否有网站图标

enter image description here

同时指定网站图标和网站地址的父项

enter image description here

查看与该父规范的所有匹配项,检查是否包含网站图标(这是伪类选择器将简化的地方)。如果找到匹配项,那么我们将拥有图标和希望的网站地址的共享父级;使用childOfSiblingCssSelector(在这种情况下,我们正在寻找以下div的子级)css选择器提取网站网址。


注释:

  1. 整个过程保持了较高的级别/通用性,因此您可以调整css选择器,以期满足不同的情况。结果-似乎有些冗长。
  2. 提供了
  3. Helper函数来处理元素匹配。用一种有意义的方式命名它们。我认为这里还有待改进。
  4. 从技术上讲,第二个助手GetText可用于提取电子邮件地址(我可能会在函数调用中添加另一个参数以指定要提取的属性)以及网站地址,但看起来要快得多(目前,仅定位到如上所述的适当的href
  5. 我将css选择器作为接近其用法的局部变量保存;您可以将它们作为常量,更靠近模块顶部,也许更容易访问?不确定是否会随着时间的推移/不同的URL表现如何
  6. .getElementsBy方法中选择
  7. Css选择器有两个原因:1)css选择器具有浏览器优化功能,因此,如果格式正确,css会更快2)我想保留代码/帮助程序的灵活性功能-对于CSS选择器,您可以表达什么模式有更多的特异性。我认为这很重要,因为我不知道您将来可能需要处理哪些案件。
  8. 我故意不使用类名和索引,例如doc.getElementsByClassName("_50f4")(2).innerText,因为我不熟悉潜在的用例范围;这种感觉有点脆弱,因为它依赖于元素的一致排序和编号(至少直到这些索引为止)。

待办事项:

  1. 与每次在HTMLDocument中实例化一个新的GetText相比,在函数签名中(即从调用过程中)传递另一个HTMLDocument自变量更为有效。重构可以考虑到这一点。
  2. 这种类型的编码将来可能会成为基于类的。尤其是如果要添加错误处理和进一步的功能。

VBA:

Option Explicit
'VBE > Tools > References > HTML Object Library
Public Sub test()
    Dim ie As Object, ws As Worksheet
    Set ie = CreateObject("InternetExplorer.Application")
    Set ws = ThisWorkbook.Worksheets("Scraper")

    With ie
        .Visible = True
        .Navigate2 "https://www.facebook.com/pg/SalemFordNH/about/?ref=page_internal%5Blink%5D"

        While .Busy Or .readyState < 4: DoEvents: Wend

        With .document

            Dim email As String, website As String, iconCssSelector As String
            'iconCssSelector for website icon in this instance
            iconCssSelector = "[src='https://static.xx.fbcdn.net/rsrc.php/v3/yV/r/EaDvTjOwxIV.png']"

            If ElementIsPresent(ie.document, "[href^=mailto]") Then
                email = ie.document.querySelector("[href^=mailto]").innerText
            Else
                email = "Not found"
            End If

            Dim parents As Object, sharedParentCssSelector As String, childOfSiblingCssSelector As String
            sharedParentCssSelector = "._5aj7" 'target parent of both icon and the website link
            childOfSiblingCssSelector = "._50f4" '< to target website address after finding right parent

            If ElementIsPresent(ie.document, iconCssSelector) _
                And ElementIsPresent(ie.document, sharedParentCssSelector) Then

                Set parents = ie.document.querySelectorAll(sharedParentCssSelector) 'css selector used to allow for greater flexibility in element matching
                website = GetText(ie.document, parents, iconCssSelector, childOfSiblingCssSelector)
            Else
                website = "Not found"
            End If
        End With
        'Assumes headers already present
        Dim nextRow As Long
        nextRow = GetLastRow(ws, 1) + 1
        ws.Cells(nextRow, 1).Resize(1, 2) = Array(email, website)
        .Quit
    End With
End Sub

Public Function ElementIsPresent(ByVal document As HTMLDocument, ByVal cssSelector As String) As Boolean
    ElementIsPresent = document.querySelectorAll(cssSelector).length > 0
End Function

Public Function GetText(ByVal document As HTMLDocument, ByVal parents As Object, ByVal iconCssSelector As String, ByVal childOfSiblingCssSelector As String) As String
    'in this instance and with microsoft IE DOM you cannot select for parent of an element with pseudo class _
    of :has(>child); nor use :contains... instead pass expected parent selector, that houses _
    both the icon element for website and the website address itself, and loop all matches checking for website icon _
    if found use childOfSiblingCssSelector to extract
    Dim i As Long, html As HTMLDocument
    Set html = New HTMLDocument

    For i = 0 To parents.length - 1
        html.body.innerHTML = parents.item(i).innerHTML
        If ElementIsPresent(html, iconCssSelector) Then
            GetText = html.querySelector(childOfSiblingCssSelector).innerText
            Exit Function
        End If
    Next
    GetText = "Not found"
End Function

Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
    With ws
        GetLastRow = .Cells(.rows.Count, columnNumber).End(xlUp).Row
    End With
End Function

项目参考(VBE>工具>参考):

  1. Microsoft HTML对象库

其他阅读内容:

  1. https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors
  2. https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelectorAll
  3. https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelector

编辑:

循环示例-假设网址之间A列中没有空行。

Option Explicit
'VBE > Tools > References > HTML Object Library
Public Sub test()
    Dim ie As Object, ws As Worksheet, wsUrls As Worksheet, urls()
    Set ie = CreateObject("InternetExplorer.Application")
    Set ws = ThisWorkbook.Worksheets("Scraper")
    Set wsUrls = ThisWorkbook.Worksheets("Url List")

    With wsUrls
        urls = Application.Transpose(.Range("A2:A" & .Cells(.rows.Count, "A").End(xlUp).Row).Value)
    End With
    Dim results(), r As Long
    ReDim results(1 To UBound(urls), 1 To 2)

    With ie
        .Visible = True

        For r = LBound(urls) To UBound(urls)
            .Navigate2 urls(r)

            While .Busy Or .readyState < 4: DoEvents: Wend

            With .document

                Dim email As String, website As String, iconCssSelector As String
                'iconCssSelector for website icon in this instance
                iconCssSelector = "[src='https://static.xx.fbcdn.net/rsrc.php/v3/yV/r/EaDvTjOwxIV.png']"

                If ElementIsPresent(ie.document, "[href^=mailto]") Then
                    email = ie.document.querySelector("[href^=mailto]").innerText
                Else
                    email = "Not found"
                End If

                Dim parents As Object, sharedParentCssSelector As String, childOfSiblingCssSelector As String
                sharedParentCssSelector = "._5aj7" 'target parent of both icon and the website link
                childOfSiblingCssSelector = "._50f4" '< to target website address after finding right parent

                If ElementIsPresent(ie.document, iconCssSelector) _
        And ElementIsPresent(ie.document, sharedParentCssSelector) Then

                    Set parents = ie.document.querySelectorAll(sharedParentCssSelector) 'css selector used to allow for greater flexibility in element matching
                    website = GetText(ie.document, parents, iconCssSelector, childOfSiblingCssSelector)
                Else
                    website = "Not found"
                End If
            End With
            'Assumes headers already present
            Dim nextRow As Long
            results(r, 1) = email
            results(r, 2) = website
        Next
        .Quit
    End With
    nextRow = GetLastRow(ws, 1) + 1
    ws.Cells(nextRow, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End Sub

Public Function ElementIsPresent(ByVal document As HTMLDocument, ByVal cssSelector As String) As Boolean
    ElementIsPresent = document.querySelectorAll(cssSelector).length > 0
End Function

Public Function GetText(ByVal document As HTMLDocument, ByVal parents As Object, ByVal iconCssSelector As String, ByVal childOfSiblingCssSelector As String) As String
    'in this instance and with microsoft IE DOM you cannot select for parent of an element with pseudo class _
    of :has(>child); nor use :contains... instead pass expected parent selector, that houses _
    both the icon element for website and the website address itself, and loop all matches checking for website icon _
    if found use childOfSiblingCssSelector to extract
    Dim i As Long, html As HTMLDocument
    Set html = New HTMLDocument

    For i = 0 To parents.length - 1
        html.body.innerHTML = parents.item(i).innerHTML
        If ElementIsPresent(html, iconCssSelector) Then
            GetText = html.querySelector(childOfSiblingCssSelector).innerText
            Exit Function
        End If
    Next
    GetText = "Not found"
End Function

Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
    With ws
        GetLastRow = .Cells(.rows.Count, columnNumber).End(xlUp).Row
    End With
End Function