VBA-从DIV中提取文本,其中没有CLASS,ID,TAG

时间:2017-11-15 14:42:55

标签: excel vba

这是脚本

<div style="padding-left:0em; margin:10px 25px 24px 0px;">
  <div style='word-wrap:break-word; max-height:50px; overflow:hidden;'>                 
    <a style="display:unset; font-size:1.25em;" href="http://www.aptcenter.research.va.gov/staff/investigators/zhu/">Hui Zhu, MD, ScD - Advanced Platform Technology <b>Center</b></a>
  </div>

  <div class="url" style='word-spacing:normal; word-wrap:break-word; color:#008933; margin-bottom:2px; margin-top:2px'>
        www.aptcenter.research.va.gov/&#8203;staff/&#8203;investigators/&#8203;zhu
  </div>
  <div style='color:#545454; line-height:1.3em; word-wrap:break-word; margin-bottom:2px;'>Jewish Hospital. Fellowship training took place in Urological Oncology at the <b>Memorial</b> <b>Sloan</b>-<b>Kettering</b> <b>Cancer</b> <b>Center</b> in New York <b>Center</b>. Dr. Zhu</div>
    <div class='no-print' style='margin-bottom:10px; margin-top:4px; color:#010101;'>                   
        View results from <a href="/search/va/va_search.jsp?QT=Memorial+Sloan+Kettering+Cancer+Center&amp;SQ=www.aptcenter.research.va.gov&amp;DB=3">APTCENTER RESEARCH</a>         
        or related <a href="/search/va/va_search.jsp?QT=Memorial+Sloan+Kettering+Cancer+Center&amp;DB=3">Health Research</a> sites
    </div>                  
  </div> 

我想获得标题,即&#34; Hui Zhu,MD,ScD - 高级平台技术中心&#34;然后在下面的文字链接,即&#34;犹太医院。在纽约中心纪念斯隆 - 凯特琳癌症中心的泌尿外科肿瘤学院进行了奖学金培训。朱博士&#34;

这是我目前的代码

Dim objIE As Object, Slink As String
Dim hUls As MSHTML.IHTMLElementCollection
Dim httpRequest As XMLHTTP
Dim divElem As HTMLDivElement

Set httpRequest = New XMLHTTP

Slink = "https://www.index.va.gov/search/va/va_search.jsp?QT=Memorial+Sloan+Kettering+Cancer+Center&RS=1"

With httpRequest
    .Open "GET", Slink, False
    .setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
    .send
End With
With httpRequest
    While Not .readyState = 4
        Application.Wait Now + TimeValue("0:00:01")
    Wend
    If .Status = 200 Then
        While InStr(1, .responseText, "Updating", 0) > 0
            Application.Wait Now + TimeValue("0:00:01")
        Wend
        Set oHtml = New HTMLDocument
        oHtml.body.innerHTML = .responseText
    End If
End With

Set divElem = oHtml.getElementById("innerContent")
MsgBox divElem.getAttribute("padding-left:0em; margin:10px 25px 24px 0px;").cssText

1 个答案:

答案 0 :(得分:0)

经过一些调整,假设HTML没有真正改变,但这应该会帮助你。

它将在末尾返回一个数组,其中包含“链接”文本和描述该链接的“正文”文本。

删除Set divElem,然后删除...并替换为:

Dim i As Long
Dim title() As Variant
ReDim title(0)

Dim ele As Object

Set aElem = oHtml.getElementsByTagName("a")
For Each ele In aElem
    If ele.Style = "FONT-SIZE" Then ' Will return each of the search findings main info
        title(UBound(title)) = ele.innerText
        ReDim Preserve title(LBound(title) To UBound(title) + 1)
    End If
Next ele

Dim text() As Variant
ReDim text(0)

Set divElem = oHtml.getElementsByTagName("div")
For Each ele In divElem
    If ele.getAttribute("style").cssText = "WORD-WRAP: break-word; MARGIN-BOTTOM: 2px; COLOR: #545454; LINE-HEIGHT: 1.3em" Then
        text(UBound(text)) = ele.innerText
        ReDim Preserve text(LBound(text) To UBound(text) + 1)
    End If
Next ele

Dim combinedInfo() As Variant
ReDim combinedInfo(UBound(title), 1)
For i = LBound(title) To UBound(title)
    combinedInfo(i, 0) = title(i)
    combinedInfo(i, 1) = text(i)
Next i

For i = LBound(title) To UBound(title) - 1
    Debug.Print combinedInfo(i, 0) & " ::: " & combinedInfo(i, 1)
Next i

End Sub

combinedInfo数组的链接文本为(i,0)索引,链接描述为(i,1)索引。

它有点笨重,但如果它有效,你可以重构为单独的函数(一个用于a标签,一个用于div标签),但它对我有用你给了。