VBA HTML列表信息提取

时间:2018-12-28 21:53:29

标签: excel vba excel-vba

我希望遵循A列中的一系列URL(例如:https://www.ebay.com/itm/Apple-iPhone-7-GSM-Unlocked-Verizon-AT-T-TMobile-Sprint-32GB-128GB-256GB/352381131997?epid=225303158&hash=item520b8d5cdd:m:mWgYDe4a79NeLuAlV-RmAQA:rk:7:pf:0),并从中获取以下信息: -标题 - 价钱 -说明

我认为我的代码有多个问题...对于一个问题,我无法使程序遵循Excel中列出的特定URL(仅当我在代码中指定一个问题时)。另外,拉多个字段也给我带来了问题。

Option Explicit
Public Sub ListingInfo()
Dim ie As New InternetExplorer, ws As Worksheet, t As Date
Dim i As Integer
i = 0

Do While Worksheets("Sheet1").Cells(i, 1).Value <> ""
Const MAX_WAIT_SEC As Long = 10
Set ws = ThisWorkbook.Worksheets("Sheet1")
With ie
    .Visible = True
    .Navigate2 Worksheets("Sheet1").Cells(i, 1).Value

    While .Busy Or .readyState < 4: DoEvents: Wend

    Dim Links As Object, i As Long, count As Long
    t = Timer
    Do
        On Error Resume Next
        Set Title = .document.querySelectorAll("it-ttl")
        Set price = .document.querySelectorAll("notranslate")
        Set Description = .document.querySelectorAll("ds_div")
        count = Links.Length
        On Error GoTo 0
        If Timer - t > MAX_WAIT_SEC Then Exit Do
    Loop While count = 0
    For i = 0 To Title.Length - 1
        ws.Cells(i + 1, 1) = Title.item(i)
        ws.Cells(i + 1, 2) = price.item(i)
        ws.Cells(i + 1, 3) = Description.item(i)
    Next
    .Quit
i = i + 1
Loop
End With
End Sub

3 个答案:

答案 0 :(得分:2)

您的代码中有很多事情要修复。这里已经很晚了,所以我将在下面给出指针(并在以后进行全面更新)和工作代码:

  1. 声明所有变量并使用适当的类型
  2. 查看For Loops以及如何使用转置来创建从工作表中拉出以循环播放的一维网址数组
  3. 查看querySelector和querySelectorAll方法之间的区别
  4. Review CSS selectors(您实际上将所有内容指定为类型选择器,而实际上您不是在通过标签选择感兴趣的元素;也不是在通过所声明的文本来选择)
  5. 考虑放置IE对象创建和.Navigate2以利用现有对象
  6. 确保使用不同的循环计数器
  7. 确保不要覆盖工作表中的值

代码:

Option Explicit
Public Sub ListingInfo()
    Dim ie As New InternetExplorer, ws As Worksheet
    Dim i As Long, urls(), rowCounter As Long
    Dim title As Object, price As Object, description As Object
    Set ws = ThisWorkbook.Worksheets("Sheet1")
    urls = Application.Transpose(ws.Range("A1:A2").Value) '<= Adjust
    With ie
        .Visible = True
        For i = LBound(urls) To UBound(urls)
            If InStr(urls(i), "http") > 0 Then
                rowCounter = rowCounter + 1
                .Navigate2 urls(i)
                While .Busy Or .readyState < 4: DoEvents: Wend
                Set title = .document.querySelector(".it-ttl")
                Set price = .document.querySelector("#prcIsum")
                Set description = .document.querySelector("#viTabs_0_is")

                ws.Cells(rowCounter, 3) = title.innerText
                ws.Cells(rowCounter, 4) = price.innerText
                ws.Cells(rowCounter, 5) = description.innerText
                Set title = Nothing: Set price = Nothing: Set description = Nothing
            End If
        Next
        .Quit
    End With
End Sub

答案 1 :(得分:2)

我将对MSXML2.XMLHTTP使用后期绑定,并为HTMLDocument设置对Microsoft HTML对象库的引用。

注意:querySelector()引用找到的第一个与其搜索字符串匹配的项目。

这是简短的版本:

Public Sub ListingInfo()
    Dim cell As Range
    With ThisWorkbook.Worksheets("Sheet1")
        For Each cell In .Range("A1", .Cells(.Rows.Count, 1).End(xlUp))
            Dim Document As MSHTML.HTMLDocument
            With CreateObject("MSXML2.XMLHTTP")
                .Open "GET", cell.Value, False
                .send
                Set Document = New MSHTML.HTMLDocument
                Document.body.innerHTML = .responseText
            End With
            cell.Offset(0, 1).Value = Document.getElementByID("itemTitle").innerText
            cell.Offset(0, 2).Value = Document.getElementByID("prcIsum").innerText

            If Not Document.querySelector(".viSNotesCnt") Is Nothing Then
                cell.Offset(0, 3).Value = Document.querySelector(".viSNotesCnt").innerText
            Else
                'Try Something Else
            End If
        Next
    End With
End Sub

一种更精细的解决方案是将代码分解为较小的例程,然后将数据加载到Array中。这样的主要优点是您可以分别测试每个子例程。

Option Explicit
Public Type tListingInfo
    Description As String
    Price As Currency
    Title As String
End Type

Public Sub ListingInfo()
    Dim source As Range
    Dim data As Variant
    With ThisWorkbook.Worksheets("Sheet1")
        Set source = .Range("A1:D1", .Cells(.Rows.count, 1).End(xlUp))
        data = source.Value
    End With
    Dim r As Long
    Dim record As tListingInfo
    Dim url As String

    For r = 1 To UBound(data)
        record = getListingInfo()
        url = data(r, 1)
        record = getListingInfo(url)
        With record
            data(r, 2) = .Description
            data(r, 3) = .Price
            data(r, 4) = .Title
        End With
    Next
    source.Value = data
End Sub

Public Function getListingInfo(url As String) As tListingInfo
    Dim ListingInfo As tListingInfo
    Dim Document As MSHTML.HTMLDocument
    Set Document = getHTMLDocument(url)

    With ListingInfo
        .Description = Document.getElementByID("itemTitle").innerText
        .Price = Split(Document.getElementByID("prcIsum").innerText)(1)
        .Title = Document.querySelectorAll(".viSNotesCnt")(0).innerText
        Debug.Print .Description, .Price, .Title
    End With
End Function

Public Function getHTMLDocument(url As String) As MSHTML.HTMLDocument
    Const READYSTATE_COMPLETE As Long = 4

    Dim Document As MSHTML.HTMLDocument
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", url, False
        .send
        If .readyState = READYSTATE_COMPLETE And .Status = 200 Then
            Set Document = New MSHTML.HTMLDocument
            Document.body.innerHTML = .responseText
            Set getHTMLDocument = Document
        Else
            MsgBox "URL:  " & vbCrLf & "Ready state: " & .readyState & vbCrLf & "HTTP request status: " & .Status, vbInformation, "URL Not Responding"
        End If
    End With
End Function

答案 2 :(得分:1)

这是使用Web请求和MSXML的方法。它应该比使用IE快得多,并且我鼓励您在可能的情况下强烈考虑使用此方法。

您需要对Microsoft HTML对象库和Microsoft XML v6.0的引用才能使其正常工作。

Option Explicit

Public Sub SubmitRequest()
    Dim URLs                              As Excel.Range
    Dim URL                               As Excel.Range
    Dim LastRow                           As Long
    Dim wb                                As Excel.Workbook: Set wb = ThisWorkbook
    Dim ws                                As Excel.Worksheet: Set ws = wb.Worksheets(1)
    Dim ListingDetail                     As Variant
    Dim i                                 As Long
    Dim j                                 As Long
    Dim html                              As HTMLDocument

    ReDim ListingDetail(0 To 2, 0 To 10000)

    'Get URLs
    With ws
        LastRow = .Cells(.Rows.Count, 1).End(xlUp).Row
        Set URLs = .Range(.Cells(1, 1), .Cells(LastRow, 1))
    End With

    'Update the ListingDetail
    For Each URL In URLs
        Set html = getHTML(URL.Value2)
        ListingDetail(0, i) = html.getElementByID("itemTitle").innertext 'Title
        ListingDetail(1, i) = html.getElementByID("prcIsum").innertext 'Price
        ListingDetail(2, i) = html.getElementsByClassName("viSNotesCnt")(0).innertext 'Seller Notes
        i = i + 1
    Next

    'Resize array
    ReDim Preserve ListingDetail(0 To 2, 0 To i - 1)

    'Dump in Column T,U,V of existing sheet
    ws.Range("T1:V" & i).Value = WorksheetFunction.Transpose(ListingDetail)
End Sub

Private Function getHTML(ByVal URL As String) As HTMLDocument
    'Add a reference to Microsoft HTML Object Library
    Set getHTML = New HTMLDocument

    With New MSXML2.XMLHTTP60
        .Open "GET", URL
        .send
        getHTML.body.innerHTML = .responseText
    End With
End Function