我有从Discourse API检索的HTML字符串,其中包含一些元素(p, span, div
等),其中一些具有诸如data-time, data-timezone, data-email-preview
等的属性。我希望属性data-email-preview
上具有值并且这些值是格式为enter code here
的时间戳。这些值始终位于HTML字符串内的前两个span元素之间。 HTML字符串示例:
<p><span data-date="2019-05-10" data-time="19:00:00" class="discourse-local-date" data-timezones="Europe/Brussels" data-timezone="Europe/Berlin" data-email-preview="2019-05-10T17:00:00Z UTC">2019-05-10T17:00:00Z</span> → <span data-date="2019-05-10" data-time="22:00:00" class="discourse-local-date" data-timezones="Europe/Brussels" data-timezone="Europe/Berlin" data-email-preview="2019-05-10T20:00:00Z UTC">2019-05-10T20:00:00Z</span><br>
<div class="lightbox-wrapper"><div class="meta">
<span class="filename">HackSpace_by_Sugar_Ray_Banister.jpg</span><span class="informations">1596×771 993 KB</span><span class="expand"></span>
</div></a></div></p>
我需要提取span
个元素之间的这两个日期:
2019-05-10T17:00:00Z
和2019-05-10T20:00:00Z
答案 0 :(得分:1)
(?<=>)(\d{4}\-\d{2}\-\d{2}T\d{2}\:\d{2}\:\d{2}Z)(?=<\/span>)
将为您返回所需的元素
答案 1 :(得分:0)
答案 2 :(得分:0)
您可以使用github上的HTML DOM库实现此目的,但是我使用sourceforge在此链接https://simplehtmldom.sourceforge.io上下载
按以下方式使用
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
您应该将span用作
// find('span.data-email-preview') if not work use find('date-email-preview')
如果要使用preg_replace很简单,但是会令人困惑,因为其中有很多值,因此输出将有许多日期,那么您必须在此输出后创建数组,然后再进行循环以单行查看每个日期,因此您可以导入数据库
答案 3 :(得分:0)
在VBA中类似
Sub Extract2()
Dim hDoc As MSHTML.HTMLDocument
Dim hElem As MSHTML.HTMLGenericElement
Dim sFile As String, lFile As Long
Dim pat1 As String
Dim sHtml As String
strHtml = "c:\1.html"
'read in the file
lFile = FreeFile
sFile = strDir & strHtml
Open sFile For Input As lFile
sHtml = Input$(LOF(lFile), lFile)
'put into an htmldocument object
Set hDoc = New MSHTML.HTMLDocument
hDoc.body.innerHTML = sHtml
Set dateBody = hDoc.getElementsByClassName("discourse-local-date")
Date1 = dateBody(0).innerText
Date2 = dateBody(1).innerText
MsgBox Date1 & " " & Date2
'regex
pat1 = ".*span.*>(.+?)<"
Date1 = simpleRegex(sHtml, pat1, 0)
Date2 = simpleRegex(sHtml, pat1, 1)
MsgBox Date1 & " " & Date2
End Sub
正则表达式的功能
Function simpleRegex(strInput As String, strPattern As String, sNr As Long)
Dim regEx As New RegExp
If strPattern <> "" Then
With regEx
.Global = True
.MultiLine = True
.IgnoreCase = True
.Pattern = strPattern
End With
dfs = regEx.Test(strInput)
If regEx.Test(strInput) Then
Set sReg = regEx.Execute(strInput)
simpleRegex = sReg(sNr).SubMatches(0)
Else
simpleRegex = "false"
End If
End If
End Function