Question

我在MS Excel中提取单词时遇到问题。我连续有多个 HTML 格式的句子，想提取所有以....

分隔的单词

示例：

<b>buat</b> <i>v</i> <b>1</b> kerjakan; lakukan; <b>2</b> bikin;<br>--<b> cendol</b>

我要提取以下单词："buat", "1", "2", "cendol"

您能帮我解决我的问题吗？感谢Excel / VBA中的任何代码。

Answer 1

有一个非常简单的方法，可以使用HTMLDocument对象：

在您的VB Editor中，转到Tools>References，然后选择Microsoft HTML Object Library。

然后您可以使用以下代码：

Sub extract()

Dim doc As New HTMLDocument 'Declare and create an object of type HTMLDocument
Dim item As HTMLObjectElement 'Declare an object of type HTMLObjectElement. We will use this to loop through a collection of HTML elements

doc.body.innerHTML = "<b>buat</b> <i>v</i> <b>1</b> kerjakan; lakukan; <b>2</b> bikin;<br>--<b> cendol</b> " 'Assign your HTML code as a string in doc body

For Each item In doc.getElementsByTagName("b") 'Loop through all the <b></b> elements in doc
    Debug.Print item.innerText 'print the text contained in <b></b> element. This will show up in your immediate window
Next item

End Sub

Answer 2

如果您拥有Excel 2013+，则可以使用工作表功能FILTERXML完成

首先，您需要通过将字符串包含在外部标签中并关闭不匹配的 标签来将字符串更改为“格式正确的” XML：

"<t>" & $A$1 & "</br></t>"

然后只需要使用Xpath即可返回所有所需标签的问题：

FILTERXML("<t>" & $A$1 & "</br></t>","//b")

在INDEX函数中进行包装可以一次提取一个子字符串：

完整公式 进入A3并填写

=IFERROR(INDEX(FILTERXML("<t>" & $A$1 & "</br></t>","//b"),ROWS($1:1)),"")

Answer 3

尝试一下

Sub Test()
Dim objReg      As Object
Dim objMatches  As Object
Dim match       As Object
Dim s           As String
Dim i           As Integer

s = "<b>buat</b> <i>v</i> <b>1</b> kerjakan; lakukan; <b>2</b> bikin;<br>--<b> cendol</b> "
Set objReg = CreateObject("VBScript.RegExp")

With objReg
    .IgnoreCase = False
    .Global = True
    .Pattern = "<b>(.*?)<\/b>"
    Set objMatches = .Execute(s)
End With

For Each match In objMatches
    For i = 0 To match.Submatches.Count - 1
        Debug.Print Trim(match.Submatches.item(i))
    Next i
Next match

Set objReg = Nothing
End Sub

Answer 4

使用XML DomDocument 的替代方法

分析HTML字符串，显然可以像在HTMLDocument或►XML中那样使用文档对象结构。因此，出于完整性的考虑，我展示了另一种方法，除了 @StavrosJon的有效解决方案（使用更宽松的HTMLDocument并不需要像XML那样格式正确）之外，：

示例呼叫

Sub ExtractViaXML()
  Dim html$, myArray()
  html = "<b>buat</b> <i>v</i> <b>1</b> kerjakan; lakukan; <b>2</b> bikin;<br>--<b> cendol</b> "
  myArray = getTokens(html, "b")                              ' assign findings to array via function getTokens()
  Debug.Print UBound(myArray) + 1 & " token(s) found: " & Join(myArray, ", ")  ' display results
End Sub

主要功能getTokens()

Function getTokens(ByVal html$, Optional myTag$ = "b") As Variant()
' Purpose: isolate "<b>"-Tags (default value) out of html string and return found tokens as variant array
' Note:    creates temporary XML DOMDocument (late bound MSXML2 reference)
  Dim XmlString$
  XmlString = wellformed("<?xml version=""1.0"" encoding=""utf-8""?><tokens>" & html & "</tokens>")

  With CreateObject("MSXML2.DOMDocument.6.0")
      .ValidateOnParse = True: .Async = False
      If .LoadXML(XmlString) Then                              ' load xml string
          Dim myNodeList As Object
          Set myNodeList = .DocumentElement.SelectNodes(myTag) ' set node list to memory
          Dim i&, ii&, arr()
          ii = myNodeList.Length - 1                           ' calculate upper boundary of zero-based array
          If ii > -1 Then ReDim arr(ii)                        ' (re)dimension variant array arr()
          For i = 0 To ii                                      ' loop through node list
              arr(i) = myNodeList.item(i).Text                 ' assign each found text content to array
          Next i
          If ii = -1 Then arr = Array("**Nothing found**")     ' provide for zero findings
          getTokens = arr                                      ' return 0-based 1-dim array with found tokens
      Else: ShowParseError (.ParseError)                       ' optional: display possible error message
      End If
  End With
End Function

辅助功能

XML 需要具有打开和关闭标签的结构良好的节点结构，或者 HTML 对例如单个换行符（ ）。因此，我在 cure 中添加了一个简单的函数wellformed()，以防止成功加载。此外，我演示了使用可选过程ShowParseError来本地化（其他）可能的加载错误，您可以将其用作任何.load或.loadXML函数的补充。

Function wellformed$(ByVal s$)
' Purpose: force a wellformed version of line breaks in html/xml string ("<br/>")
' Note:    unclosed tags like <br> only would prevent a successful load of the xml document
  wellformed = Replace(Replace(s, "</br>", "<br>"), "<br>", "<br/>")
End Function

Sub ShowParseError(pe As Object)
' Purpose: display possible parse error
' Note:    localizes error occurrence also by indicating position
        Dim ErrText$
        With pe
           ErrText = "Load error " & .ErrorCode & " xml file " & vbCrLf & _
           Replace(.URL, "file:///", "") & vbCrLf & vbCrLf & _
          .reason & _
          "Source Text: " & .srcText & vbCrLf & vbCrLf & _
          "Line No.:    " & .Line & vbCrLf & _
          "Line Pos.: " & .linepos & vbCrLf & _
          "File Pos.:  " & .filepos & vbCrLf & vbCrLf
        End With
        MsgBox ErrText, vbExclamation
End Sub

Answer 5

我曾尝试在excel上对此进行模拟。请在下面查看我的示例解决方案。

Timestamp

Answer 6

我尝试了一些不同的方法，包括拆分，合并和拆分，以及循环槽数组。我在单元格buat v 1 kerjakan; lakukan; 2 bikin; -- cendol中键入了文本A1：

Sub Macro1()
Dim MyWords As Variant
Dim i As Long
Dim MyDelimiter As String
Dim MyLen As Byte

MyDelimiter = "||" 'Choose 1 not common delimiter
MyLen = Len(MyDelimiter)

MyWords = Split(Join(Split(Range("A1").Value, "<b>"), MyDelimiter), "</b>")

For i = 0 To UBound(MyWords) Step 1
    Debug.Print Mid(MyWords(i), InStr(1, MyWords(i), MyDelimiter) + MyLen, 99) 'Increase 99 if you are sure there will be longer texts between tags <b>..</b>
Next i

Erase MyWords
End Sub

我明白了：

提取所有定界单词<b> ... </b>

6 个答案: