从HTML标记中的文件中删除文本

时间:2012-03-18 11:59:03

标签: excel web-scraping extract analysis text-extraction

我有一个文件,我想从中提取日期,它是一个HTML源文件,所以它充满了我不需要的代码和短语。我需要提取包含在特定HTML标记中的日期的每个实例:

abbr title =“((这是我需要的文字))”data-utime =“

实现这一目标的最简单方法是什么?

2 个答案:

答案 0 :(得分:6)

如果您使用的是Excel VBA,请将参考(工具 - 参考)设置为MSHTML库(参考菜单中标题为Microsoft HTML Object Library

Sub ScrapeDateAbbr()

    Dim hDoc As MSHTML.HTMLDocument
    Dim hElem As MSHTML.HTMLGenericElement
    Dim sFile As String, lFile As Long
    Dim sHtml As String

    'read in the file
    lFile = FreeFile
    sFile = "C:/Users/dick/Documents/My Dropbox/Excel/Testabbr.html"
    Open sFile For Input As lFile
    sHtml = Input$(LOF(lFile), lFile)

    'put into an htmldocument object
    Set hDoc = New MSHTML.HTMLDocument
    hDoc.body.innerHTML = sHtml

    'loop through abbr tags
    For Each hElem In hDoc.getElementsByTagName("abbr")
        'only those that have a data-utime attribute
        If Len(hElem.getAttribute("data-utime")) > 0 Then
            'get the title attribute
            Debug.Print hElem.getAttribute("title")
        End If
    Next hElem

End Sub

我认为自从您在源文件中调用后该文件是本地文件。如果您需要先下载它,则需要另一个对MSXML和此代码的引用

Sub ScrapeDateAbbrDownload()

    Dim xHttp As MSXML2.XMLHTTP
    Dim hDoc As MSHTML.HTMLDocument
    Dim hElem As MSHTML.HTMLGenericElement

    Set xHttp = New MSXML2.XMLHTTP
    xHttp.Open "GET", "file:///C:/Users/dick/Documents/My%20Dropbox/Excel/Testabbr.html"
    xHttp.send

    Do
        DoEvents
    Loop Until xHttp.readyState = 4

    'put into an htmldocument object
    Set hDoc = New MSHTML.HTMLDocument
    hDoc.body.innerHTML = xHttp.responseText

    'loop through abbr tags
    For Each hElem In hDoc.getElementsByTagName("abbr")
        'only those that have a data-utime attribute
        If Len(hElem.getAttribute("data-utime")) > 0 Then
            'get the title attribute
            Debug.Print hElem.getAttribute("title")
        End If
    Next hElem

End Sub

答案 1 :(得分:0)

如果您使用的是Java,则可以使用Jsoup。您的问题尚不清楚,请详细说明您的具体操作