我想用正则表达式从HTML文件中提取数据,但我不知道应该使用哪种模式。 html代码来自电子邮件。
以下是html代码的一部分。我希望能够获得" 40120 LBS"。
模式会是什么样的?
我想到了类似的东西: 装运重量[任何字符] [0-9] [0-9] [0-9] [0-9] [0-9]
..等
也许你知道一些更有效的东西来实现我想要的东西。 谢谢。
<tr style='mso-yfti-irow:8' id="row_65">
<td width=170 valign=top style='width:127.5pt;background:white;
padding:3.75pt 3.75pt 3.75pt 3.75pt' id="question_65">
<p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>Shipment's
weight<o:p></o:p></span></p>
</td>
<td style='background:white;padding:3.75pt 3.75pt 3.75pt 3.75pt'
id="value_65">
<p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>40120<o:p></o:p></span></p>
</td>
</tr>
<tr style='mso-yfti-irow:9' id="row_116">
<td width=170 valign=top style='width:127.5pt;background:#F3F3F3;
padding:3.75pt 3.75pt 3.75pt 3.75pt' id="question_116">
<p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>KG
or LBS<o:p></o:p></span></p>
</td>
<td style='background:#F3F3F3;padding:3.75pt 3.75pt 3.75pt 3.75pt'
id="value_116">
<p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>LBS<o:p></o:p></span></p>
</td>
</tr>
答案 0 :(得分:1)
当然,这个解析例程并不能完全满足您的需求,它可以让您在VBA中朝着正确的方向前进。
'Requires references to Microsoft Internet Controls and Microsoft HTML Object Library
Sub Extract_TD_text()
Dim URL As String
Dim IE As InternetExplorer
Dim HTMLdoc As HTMLDocument
Dim TDelements As IHTMLElementCollection
Dim TDelement As HTMLTableCell
Dim r As Long
'Saved from www vbaexpress com/forum/forumdisplay.php?f=17
URL = "file://C:\VBAExpress_Excel_Forum.html"
Set IE = New InternetExplorer
With IE
.navigate URL
.Visible = True
'Wait for page to load
While .Busy Or .readyState <> READYSTATE_COMPLETE: DoEvents: Wend
Set HTMLdoc = .document
End With
Set TDelements = HTMLdoc.getElementsByTagName("TD")
Sheet1.Cells.ClearContents
r = 0
For Each TDelement In TDelements
'Look for required TD elements - this check is specific to VBA Express forum - modify as required
If TDelement.className = "alt2" And TDelement.Align = "center" Then
Sheet1.Range("A1").Offset(r, 0).Value = TDelement.innerText
r = r + 1
End If
Next
End Sub
不建议使用正则表达式来解析HTML,因为可能出现的所有可能模糊的边缘情况,但似乎你对HTML有一些控制,所以你应该能够避免许多边缘情况的正则表达式警察在哭泣。
此正则表达式将执行以下操作:
正则表达式
<tr\s
(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sid=(['"]?)row_([0-9]+)\1(?:\s|>))
(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>
(?:[^<]*<(?:td|p|span)\s(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?>)+([^<]*).*?</td>
(?:[^<]*<(?:td|p|span)\s(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?>)+([^<]*).*?</td>
[^<]*</tr>
注意:对于此正则表达式,您将需要使用以下标志:忽略空格,不区分大小写,并且点匹配所有字符。要更好地查看图像,您可以右键单击并选择在新窗口中显示。
给出您的示例文本
<tr style='mso-yfti-irow:8' id="row_65">
<td width=170 valign=top style='width:127.5pt;background:white;
padding:3.75pt 3.75pt 3.75pt 3.75pt' id="question_65">
<p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>Shipment's
weight<o:p></o:p></span></p>
</td>
<td style='background:white;padding:3.75pt 3.75pt 3.75pt 3.75pt'
id="value_65">
<p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>40120<o:p></o:p></span></p>
</td>
</tr>
<tr style='mso-yfti-irow:9' id="row_116">
<td width=170 valign=top style='width:127.5pt;background:#F3F3F3;
padding:3.75pt 3.75pt 3.75pt 3.75pt' id="question_116">
<p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>KG
or LBS<o:p></o:p></span></p>
</td>
<td style='background:#F3F3F3;padding:3.75pt 3.75pt 3.75pt 3.75pt'
id="value_116">
<p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>LBS<o:p></o:p></span></p>
</td>
</tr>
正则表达式将创建以下捕获组
以下匹配:
[0][0] = <tr style='mso-yfti-irow:8' id="row_65">
<td width=170 valign=top style='width:127.5pt;background:white;
padding:3.75pt 3.75pt 3.75pt 3.75pt' id="question_65">
<p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>Shipment's
weight<o:p></o:p></span></p>
</td>
<td style='background:white;padding:3.75pt 3.75pt 3.75pt 3.75pt'
id="value_65">
<p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>40120<o:p></o:p></span></p>
</td>
</tr>
[0][1] = "
[0][2] = 65
[0][3] = Shipment's
weight
[0][4] = 40120
[1][0] = <tr style='mso-yfti-irow:9' id="row_116">
<td width=170 valign=top style='width:127.5pt;background:#F3F3F3;
padding:3.75pt 3.75pt 3.75pt 3.75pt' id="question_116">
<p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>KG
or LBS<o:p></o:p></span></p>
</td>
<td style='background:#F3F3F3;padding:3.75pt 3.75pt 3.75pt 3.75pt'
id="value_116">
<p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>LBS<o:p></o:p></span></p>
</td>
</tr>
[1][1] = "
[1][2] = 116
[1][3] = KG
or LBS
[1][4] = LBS
NODE EXPLANATION
----------------------------------------------------------------------
<tr '<tr'
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace (\n,
\r, \t, \f, and " "), '>' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
id= 'id='
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
['"]? any character of: ''', '"' (optional
(matching the most amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
row_ 'row_'
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
[0-9]+ any character of: '0' to '9' (1 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
\1 what was matched by capture \1
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace (\n,
\r, \t, \f, and " "), '>' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
(?: group, but do not capture (1 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
[^<]* any character except: '<' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
< '<'
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
td 'td'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
p 'p'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
span 'span'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace (\n,
\r, \t, \f, and " "), '>' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
)+ end of grouping
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
[^<]* any character except: '<' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
.*? any character (0 or more times (matching
the least amount possible))
----------------------------------------------------------------------
</td> '</td>'
----------------------------------------------------------------------
(?: group, but do not capture (1 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
[^<]* any character except: '<' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
< '<'
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
td 'td'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
p 'p'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
span 'span'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace (\n,
\r, \t, \f, and " "), '>' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
)+ end of grouping
----------------------------------------------------------------------
( group and capture to \4:
----------------------------------------------------------------------
[^<]* any character except: '<' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \4
----------------------------------------------------------------------
.*? any character (0 or more times (matching
the least amount possible))
----------------------------------------------------------------------
</td> '</td>'
----------------------------------------------------------------------
[^<]* any character except: '<' (0 or more times
(matching the most amount possible))
----------------------------------------------------------------------
</tr> '</tr>'
答案 1 :(得分:1)
而不是使用RegExp来解析HTML文件,而是使用DOM解析器。
最直接的方法是添加对 Microsoft HTML对象库的引用并使用它。了解对象可能有点棘手,但不像尝试使用正则表达式处理HTML那样棘手!
关键是确定要用于提取值的规则。
这是一个(希望)演示该技术的例子。
Public Sub SimpleParser()
Dim doc As MSHTML.HTMLDocument
Dim b As MSHTML.HTMLBody
Dim tr As MSHTML.HTMLTableRow, td As MSHTML.HTMLTableCell
Dim columnNumber As Long, rowNumber As Long
Dim trCells As MSHTML.IHTMLElementCollection
Set doc = New MSHTML.HTMLDocument
doc.body.innerHTML = "<table><tr style='mso-yfti-irow:8' id=""row_65""> <td width=170 valign=top style='width:127.5pt;background:white; padding:3.75pt 3.75pt 3.75pt 3.75pt' id=""question_65""> <p class=MsoNormal><span style='mso-fareast-font-family:""Times New Roman""'>Shipment's weight<o:p></o:p></span></p> </td> <td style='background:white;padding:3.75pt 3.75pt 3.75pt 3.75pt' id=""value_65""> <p class=MsoNormal><span style='mso-fareast-font-family:""Times New Roman""'>40120<o:p></o:p></span></p> </td> </tr> <tr style='mso-yfti-irow:9' id=""row_116""> <td width=170 valign=top style='width:127.5pt;background:#F3F3F3; padding:3.75pt 3.75pt 3.75pt 3.75pt' id=""question_116""> <p class=MsoNormal><span style='mso-fareast-font-family:""Times New Roman""'>KG or LBS<o:p></o:p></span></p> </td> <td style='background:#F3F3F3;padding:3.75pt 3.75pt 3.75pt 3.75pt' id=""value_116""> <p class=MsoNormal><span style='mso-fareast-font-family:""Times New Roman""'>LBS<o:p></o:p></span></p> </td> </tr></table>"
Set b = doc.body
'Example of looping through elements
For Each tr In b.getElementsByTagName("tr")
rowNumber = rowNumber + 1
columnNumber = 0
For Each td In tr.getElementsByTagName("td")
columnNumber = columnNumber + 1
Debug.Print rowNumber & "," & columnNumber, td.innerText
Next
Next
'Go through each row; if the first cell is "Shipment's weight", display the next cell.
For Each tr In b.getElementsByTagName("tr")
Set trCells = tr.getElementsByTagName("td")
If trCells.Item(0).innerText = "Shipment's weight" Then Debug.Print "Weight: " & trCells.Item(1).innerText
Next
End Sub