我有一个文件,我想从中提取日期,它是一个HTML源文件,所以它充满了我不需要的代码和短语。我需要提取包含在特定HTML标记中的日期的每个实例:
abbr title =“((这是我需要的文字))”data-utime =“
实现这一目标的最简单方法是什么?
答案 0 :(得分:6)
如果您使用的是Excel VBA,请将参考(工具 - 参考)设置为MSHTML库(参考菜单中标题为Microsoft HTML Object Library
)
Sub ScrapeDateAbbr()
Dim hDoc As MSHTML.HTMLDocument
Dim hElem As MSHTML.HTMLGenericElement
Dim sFile As String, lFile As Long
Dim sHtml As String
'read in the file
lFile = FreeFile
sFile = "C:/Users/dick/Documents/My Dropbox/Excel/Testabbr.html"
Open sFile For Input As lFile
sHtml = Input$(LOF(lFile), lFile)
'put into an htmldocument object
Set hDoc = New MSHTML.HTMLDocument
hDoc.body.innerHTML = sHtml
'loop through abbr tags
For Each hElem In hDoc.getElementsByTagName("abbr")
'only those that have a data-utime attribute
If Len(hElem.getAttribute("data-utime")) > 0 Then
'get the title attribute
Debug.Print hElem.getAttribute("title")
End If
Next hElem
End Sub
我认为自从您在源文件中调用后该文件是本地文件。如果您需要先下载它,则需要另一个对MSXML和此代码的引用
Sub ScrapeDateAbbrDownload()
Dim xHttp As MSXML2.XMLHTTP
Dim hDoc As MSHTML.HTMLDocument
Dim hElem As MSHTML.HTMLGenericElement
Set xHttp = New MSXML2.XMLHTTP
xHttp.Open "GET", "file:///C:/Users/dick/Documents/My%20Dropbox/Excel/Testabbr.html"
xHttp.send
Do
DoEvents
Loop Until xHttp.readyState = 4
'put into an htmldocument object
Set hDoc = New MSHTML.HTMLDocument
hDoc.body.innerHTML = xHttp.responseText
'loop through abbr tags
For Each hElem In hDoc.getElementsByTagName("abbr")
'only those that have a data-utime attribute
If Len(hElem.getAttribute("data-utime")) > 0 Then
'get the title attribute
Debug.Print hElem.getAttribute("title")
End If
Next hElem
End Sub
答案 1 :(得分:0)
如果您使用的是Java,则可以使用Jsoup。您的问题尚不清楚,请详细说明您的具体操作