Web抓取html页面没有标签作为分隔符

时间:2017-03-27 13:20:06

标签: html excel vba dom

我正在尝试将网页中的所有文本行导入到字符串数组中。网址位于:Vaticano-La Sacra Bibbia-Genesi-Cap.1

不幸的是(可能是网页设计师的选择),在标签中没有ID或CLASS。所有行分开1或更多< BR>元件。开始和结束文本与简单菜单分开2标签< HR>。 页面代码的简洁摘录在这里:jsfiddle。 我找到了一种带文字的方法。现在我在VBA做的事情到现在为止:

注意: objDoc是来自另一个模块的Public变量,填充.responseText没有问题。

Public Sub ScriviXHTML(strBook As String, intNumCap As Integer)
Dim strDati2 As String
Dim TagBr As IHTMLElementCollection
Dim BrElement As IHTMLElement
Dim intElement As Integer
Dim objChild as Object
Dim strData, strTextCont, strNodeVal, strWholeText As String

Set objDoc2 = New HTMLDocument
Set objDoc2 = objDoc
Set objDoc = Nothing

'Put in variable string HTML code of the web page.
strDati2 = objDoc2.body.innerHTML
'Set in the variable object TAG type BR.
Set TagBr = objDoc2.body.getElementsByTagName("BR")

'Loop for all BRs in the page.
For Each BrElement In TagBr
    'Here I try to get the NextSibling element of the <br>
    ' because seems contain the text I'm looking for.
    Set objChild = BrElement.NextSibling
    With objChild
        ' Here I try to put in the variables 
        strData = Trim("" & .Data & "")
        strTextCont = Trim("" & .textContent & "")
        strNodeVal = Trim("" & .NodeValue & "")
        strWholeText = Trim("" & .wholeText & "")
    End With
    intElement = intElement + 1

Next BrElement

两个问题:
1)关于你,这是实现我想要做的事情的最好方法吗? 2)有时Element.NextSibling.Data不存在,运行时'438'错误,所以我手动移动例程的sospension点以绕过错误。我该如何拦截此错误? [请不要使用简单的On Error Resume Next!] ...更好:如何使用If ... Then ... End If语句检查NextSibling中是否存在Data成员?
谢谢。

1 个答案:

答案 0 :(得分:0)

那么您可以获取所有文本,如下所示:

Public Sub GetInfo()
    Dim sResponse As String, xhr As Object, html As New HTMLDocument
    Set xhr = CreateObject("MSXML2.XMLHTTP")
    With xhr
        .Open "GET", "http://www.vatican.va/archive/ITA0001/__P1.HTM", False
        .send
        sResponse = StrConv(.responseBody, vbUnicode)
        sResponse = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))
        html.body.innerHTML = sResponse
        [A1] = Replace$(Replace$(regexRemove(html.body.innerHTML, "<([^>]+)>"), " &nbsp;", Chr$(32)), Chr$(10), Chr$(32))
    End With
End Sub

Public Function regexRemove(ByVal s As String, ByVal pattern As String) As String
    Dim regex As Object
    Set regex = CreateObject("VBScript.RegExp")
    With regex
        .Global = True
        .MultiLine = True
        .IgnoreCase = False
        .pattern = pattern
    End With

    If regex.test(s) Then
        regexRemove = regex.Replace(s, vbNullString)
    Else
        regexRemove = s
    End If
End Function