从html文件中提取文本并导出为csv

时间:2017-12-09 17:25:00

标签: html csv export extract

我的旧网站上有5109个html文件 我想只从<title>Title 1</title>中提取文字 和<span class="mtr_message"> Text exemple 1</span> 和导出结果在csv文件中,如下所示: 第一个单元格中的标题1和第二个单元格中的文本示例1

1 个答案:

答案 0 :(得分:0)

尝试下面的WSH VBS ode。粘贴路径,将其另存为.vbs文件并运行。

Option Explicit

Dim sSourceFolder, sResultFile, sRes, oFile, sCont

sSourceFolder = "C:\Users\DELL\Desktop\tmp" ' source files folder path
sResultFile = "C:\Users\DELL\Desktop\tmp\result.csv" ' result csv file path
sRes = ""
With CreateObject("Scripting.FileSystemObject") 
    For Each oFile In .GetFolder(sSourceFolder).Files
        If LCase(.GetExtensionName(oFile.Name)) = "htm" And oFile.Size > 0 Then
            With .OpenTextFile(oFile.Path, 1, False, -2)
                If .AtEndOfStream Then sCont = "" Else sCont = .ReadAll
                .Close
            End With
            With CreateObject("VBScript.RegExp")
                .Global = True
                .IgnoreCase = True
                .Multiline = True
                .Pattern = "<title>(.*?)</title>[\s\S]*?<span class=""mtr_message"">(.*?)</span>"
                With .Execute(sCont)
                    If .Count = 1 Then sRes = sRes & """" & .Item(0).SubMatches(0) & """, """ & .Item(0).SubMatches(1) & """" & vbCrlf
                End With
            End With
        End If
    Next
    With .OpenTextFile(sResultFile, 2, True, 0)
        .Write sRes
        .Close
    End With
End With
MsgBox "Completed"

您可能需要更改代码中的文件扩展名和编码设置。目前处理具有htm扩展名的文件,并使用默认编码.OpenTextFile(oFile.Path, 1, False, -2)(Unicode - -2,ASCII - -1)读取0