我有约。 excel中包含RTF的12000个单元格(包括格式化标签)。我需要解析它们才能找到未格式化的文本。
这是其中一个带文字的单元格的示例:
{\rtf1\ansi\deflang1060\ftnbj\uc1
{\fonttbl{\f0 \froman \fcharset0 Times New Roman;}{\f1 \fswiss \fcharset238
Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red0\green0\blue0 ;}
{\stylesheet{\fs24\cf2\cb1 Normal;}{\cs1\cf2\cb1 Default Paragraph Font;}}
\paperw11908\paperh16833\margl1800\margr1800\margt1440\margb1440\headery720\footery720
\deftab720\formshade\aendnotes\aftnnrlc\pgbrdrhead\pgbrdrfoot
\sectd\pgwsxn11908\pghsxn16833\marglsxn1800\margrsxn1800\margtsxn1440\margbsxn1440
\headery720\footery720\sbkpage\pgncont\pgndec
\plain\plain\f1\fs24\pard TPR 0160 000\par IPR 0160 000\par OB-R-02-28\par}
我真正需要的就是:
TPR 0160 000
IPR 0160 000
OB-R-02-28
简单地循环遍历单元格并删除不必要的格式化的问题是,并非这些12000单元格中的所有内容都像这样简单。所以我需要手动检查许多不同的版本并编写几个变体;并且最后还会有很多手工工作要做。
但是,如果我将一个单元格的内容复制到空文本文档并将其保存为RTF,然后使用MS Word打开它,它会立即解析文本并得到我想要的内容。不幸的是,对12000个电池来说非常不方便。
所以我在考虑VBA宏,将单元格内容移动到Word,强制解析然后将结果复制回原始单元格。不幸的是,我不确定该怎么做。
有人有任何想法吗?还是一种不同的方法?我将非常感谢解决方案或推动正确的方向。
TNX!
答案 0 :(得分:7)
如果您确实想要沿着使用Word解析文本的路线,此功能应该可以帮助您解决问题。正如评论所示,您需要引用MS Word对象库。
Function ParseRTF(strRTF As String) As String
Dim wdDoc As Word.Document 'Ref: Microsoft Word 11.0 Object Library'
Dim f As Integer 'Variable to store the file I/O number'
'File path for a temporary .rtf file'
Const strFileTemp = "C:\TempFile_ParseRTF.rtf"
'Obtain the next valid file I/O number'
f = FreeFile
'Open the temp file and save the RTF string in it'
Open strFileTemp For Output As #f
Print #f, strRTF
Close #f
'Open the .rtf file as a Word.Document'
Set wdDoc = GetObject(strFileTemp)
'Read the now parsed text from the Word.Document'
ParseRTF = wdDoc.Range.Text
'Delete the temporary .rtf file'
Kill strFileTemp
'Close the Word connection'
wdDoc.Close False
Set wdDoc = Nothing
End Function
您可以使用类似的内容为您的12,000个单元格中的每一个调用它:
Sub ParseAllRange()
Dim rngCell As Range
Dim strRTF As String
For Each rngCell In Range("A1:A12000")
'Parse the cell contents'
strRTF = ParseRTF(CStr(rngCell))
'Output to the cell one column over'
rngCell.Offset(0, 1) = strRTF
Next
End Sub
ParseRTF函数需要大约一秒的时间才能运行(至少在我的机器上),因此对于12,000个单元格,这将在大约三个半小时内完成。
在周末考虑过这个问题之后,我确信有一个更好(更快)的解决方案。
我记得剪贴板的RTF功能,并意识到可以创建一个类,将RTF数据复制到剪贴板,粘贴到word文档,然后输出生成的纯文本。这个解决方案的好处是不必为每个rtf字符串打开和关闭单词doc对象;它可以在循环之前打开并在之后关闭。
以下是实现此目的的代码。它是一个名为clsRTFParser的Class模块。
Private Declare Function GlobalAlloc Lib "kernel32" _
(ByVal wFlags&, ByVal dwBytes As Long) As Long
Private Declare Function GlobalLock Lib "kernel32" _
(ByVal hMem As Long) As Long
Private Declare Function GlobalUnlock Lib "kernel32" _
(ByVal hMem As Long) As Long
Private Declare Function lstrcpy Lib "kernel32" _
(ByVal lpString1 As Any, ByVal lpString2 As Any) As Long
Private Declare Function OpenClipboard Lib "user32" _
(ByVal Hwnd As Long) As Long
Private Declare Function EmptyClipboard Lib "user32" () As Long
Private Declare Function RegisterClipboardFormat Lib "user32" Alias _
"RegisterClipboardFormatA" (ByVal lpString As String) As Long
Private Declare Function SetClipboardData Lib "user32" _
(ByVal wFormat As Long, ByVal hMem As Long) As Long
Private Declare Function CloseClipboard Lib "user32" () As Long
'---'
Dim wdDoc As Word.Document 'Ref: Microsoft Word 11.0 Object Library'
Private Sub Class_Initialize()
Set wdDoc = New Word.Document
End Sub
Private Sub Class_Terminate()
wdDoc.Close False
Set wdDoc = Nothing
End Sub
'---'
Private Function CopyRTF(strCopyString As String) As Boolean
Dim hGlobalMemory As Long
Dim lpGlobalMemory As Long
Dim hClipMemory As Long
Dim lngFormatRTF As Long
'Allocate and copy string to memory'
hGlobalMemory = GlobalAlloc(&H42, Len(strCopyString) + 1)
lpGlobalMemory = GlobalLock(hGlobalMemory)
lpGlobalMemory = lstrcpy(lpGlobalMemory, strCopyString)
'Unlock the memory and then copy to the clipboard'
If GlobalUnlock(hGlobalMemory) = 0 Then
If OpenClipboard(0&) <> 0 Then
Call EmptyClipboard
'Save the data as Rich Text Format'
lngFormatRTF = RegisterClipboardFormat("Rich Text Format")
hClipMemory = SetClipboardData(lngFormatRTF, hGlobalMemory)
CopyRTF = CBool(CloseClipboard)
End If
End If
End Function
'---'
Private Function PasteRTF() As String
Dim strOutput As String
'Paste the clipboard data to the wdDoc and read the plain text result'
wdDoc.Range.Paste
strOutput = wdDoc.Range.Text
'Get rid of the new lines at the beginning and end of the document'
strOutput = Left(strOutput, Len(strOutput) - 2)
strOutput = Right(strOutput, Len(strOutput) - 2)
PasteRTF = strOutput
End Function
'---'
Public Function ParseRTF(strRTF As String) As String
If CopyRTF(strRTF) Then
ParseRTF = PasteRTF
Else
ParseRTF = "Error in copying to clipboard"
End If
End Function
您可以使用类似的内容为您的12,000个单元格中的每一个调用它:
Sub CopyParseAllRange()
Dim rngCell As Range
Dim strRTF As String
'Create new instance of clsRTFParser'
Dim RTFParser As clsRTFParser
Set RTFParser = New clsRTFParser
For Each rngCell In Range("A1:A12000")
'Parse the cell contents'
strRTF = RTFParser.ParseRTF(CStr(rngCell))
'Output to the cell one column over'
rngCell.Offset(0, 1) = strRTF
Next
End Sub
我在我的机器上使用示例RTF字符串模拟了这个。对于12,000个细胞,花了两分半钟,这是一个更合理的时间框架!
答案 1 :(得分:2)
您可以尝试使用正则表达式解析每个单元格,只留下您需要的内容。
每个RTF控制代码以“\”开头,以空格结尾,两者之间没有任何额外空格。 “{}”用于分组。如果您的文本不包含任何内容,则可以删除它们(“;”相同)。所以现在你继续使用你的文本和一些不必要的单词作为“Arial”,“Normal”等。你也可以构建字典来删除它们。经过一些调整后,您将只使用所需的文本。
请查看http://www.regular-expressions.info/以获取更多信息以及编写RegExp的绝佳工具(RegexBuddy - 遗憾的是它不是免费的,但它值得花钱.AFAIR还有试用版)。
更新:当然,我不鼓励你为每个细胞手动完成。只需迭代活动范围: 参考这个帖子: SO: About iterating through cells in VBA
就个人而言,我会尝试这个想法:
Sub Iterate()
For Each Cell in ActiveSheet.UsedRange.Cells
'Do something
Next
End Sub
如何在VBA(Excel)中使用RegExp?
参见: Regex functions in Excel 和 Regex in VBA
基本上你要通过COM使用VBScript.RegExp对象。
答案 2 :(得分:1)
此处的一些解决方案需要引用MS Word对象库。玩我所处理的卡片,我找到了一个不依赖它的解决方案。它在VBA中剥离了RTF标签以及其他类似绒毛的字体表和样式表。它可能对你有所帮助。我在你的数据中运行它,除了空白之外,我得到的输出与你预期的相同。
这是代码。
首先,检查字符串是否为字母数字。给它一个长度为一个字符的字符串。这个函数用于在这里和那里计算出界限。
Public Function Alphanumeric(Character As String) As Boolean
If InStr("ABCDEFGHIJKKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-", Character) Then
Alphanumeric = True
Else
Alphanumeric = False
End If
End Function
接下来是删除整个组。我用它来删除字体表和其他垃圾。
Public Function RemoveGroup(RTFString As String, GroupName As String) As String
Dim I As Integer
Dim J As Integer
Dim Count As Integer
I = InStr(RTFString, "{\" & GroupName)
' If the group was not found in the RTF string, then just return that string unchanged.
If I = 0 Then
RemoveGroup = RTFString
Exit Function
End If
' Otherwise, we will need to scan along, from the start of the group, until we find the end of the group.
' The group is delimited by { and }. Groups may be nested, so we need to count up if we encounter { and
' down if we encounter }. When that count reaches zero, then the end of the group has been found.
J = I
Do
If Mid(RTFString, J, 1) = "{" Then Count = Count + 1
If Mid(RTFString, J, 1) = "}" Then Count = Count - 1
J = J + 1
Loop While Count > 0
RemoveGroup = Replace(RTFString, Mid(RTFString, I, J - I), "")
End Function
好的,此函数会删除所有标签。
Public Function RemoveTags(RTFString As String) As String
Dim L As Long
Dim R As Long
L = 1
' Search to the end of the string.
While L < Len(RTFString)
' Append anything that's not a tag to the return value.
While Mid(RTFString, L, 1) <> "\" And L < Len(RTFString)
RemoveTags = RemoveTags & Mid(RTFString, L, 1)
L = L + 1
Wend
'Search to the end of the tag.
R = L + 1
While Alphanumeric(Mid(RTFString, R, 1)) And R < Len(RTFString)
R = R + 1
Wend
L = R
Wend
End Function
我们可以用明显的方式删除花括号:
Public Function RemoveBraces(RTFString As String) As String
RemoveBraces = Replace(RTFString, "{", "")
RemoveBraces = Replace(RemoveBraces, "}", "")
End Function
将上述功能复制粘贴到模块中后,您可以创建一个功能,使用它们去掉任何您不需要或不需要的东西。以下在我的案例中完美无缺。
Public Function RemoveTheFluff(RTFString As String) As String
RemoveTheFluff = Replace(RTFString, vbCrLf, "")
RemoveTheFluff = RemoveGroup(RemoveTheFluff, "fonttbl")
RemoveTheFluff = RemoveGroup(RemoveTheFluff, "colortbl")
RemoveTheFluff = RemoveGroup(RemoveTheFluff, "stylesheet")
RemoveTheFluff = RemoveTags(RemoveBraces(RemoveTheFluff))
End Function
我希望这会有所帮助。我不会在文字处理器或任何东西中使用它,但它可能会用于抓取数据,如果这就是你正在做的事情。
答案 3 :(得分:0)
您的帖子听起来好像每个RTF文档都存储在一个Excell单元格中。如果是,那么
Solution using .Net Framework RichTextBox control
将每个单元格中的RTF转换为2行代码中的纯文本(在稍微系统配置之后获取正确的.tlb文件以允许引用.Net Framework)。将单元格值放在 rtfsample 和
中Set miracle = New System_Windows_Forms.RichTextBox
With miracle
.RTF = rtfText
PlainText = .TEXT
End With