使用PHP快速将(.rtf | .doc)文件转换为Markdown语法

时间:2009-06-25 13:00:34

标签: php automation markdown file-conversion .doc

我已经手动将文章转换为Markdown语法几天了,而且它变得相当乏味。其中一些是3或4页,斜体和其他强调文本。有没有更快的方法来转换(.rtf | .doc)文件来清理我可以利用的Markdown语法?

7 个答案:

答案 0 :(得分:91)

如果您碰巧使用Mac,textutil可以很好地将doc,docx和rtf转换为html,而pandoc可以很好地将生成的html转换为markdown:

$ textutil -convert html file.doc -stdout | pandoc -f html -t markdown -o file.md

我有一个script,我把它扔了一会儿试图使用textutil,pdf2html和pandoc将我扔的任何东西转换成markdown。

答案 1 :(得分:11)

ProgTips有一个Word macro (source download)的可能解决方案:

  

simple macro (source download)用于自动转换最琐碎的事物。   这个宏确实:

     
      
  • 替换粗体和斜体
  •   
  • 更换标题(标记为标题1-6)
  •   
  • 替换已编号和项目符号列表
  •   
     

这是非常错误的,我相信它挂在更大的文件上,不过我是   不管怎么说它不是一个稳定的版本! :-)仅限实验用途,   根据需要重新编码并重复使用,如果您找到了,请发表评论   更好的解决方案。

来源:ProgTips

宏源

安装

  
      
  • 打开WinWord,
  •   
  • 按Alt + F11打开VBA编辑器
  •   
  • 右键单击项目浏览器中的第一个项目
  •   
  • 选择insert-> module
  •   
  • 粘贴文件中的代码
  •   
  • 关闭宏编辑器
  •   
  • go tools> macro> macros;运行名为MarkDown的宏
  •   

来源:ProgTips

来源

如果ProgTips删除帖子或网站被删除,安全保存的宏源:

'*** A simple MsWord->Markdown replacement macro by Kriss Rauhvargers, 2006.02.02.
'*** This tool does NOT implement all the markup specified in MarkDown definition by John Gruber, only
'*** the most simple things. These are:
'*** 1) Replaces all non-list paragraphs to ^p paragraph so MarkDown knows it is a stand-alone paragraph
'*** 2) Converts tables to text. In fact, tables get lost.
'*** 3) Adds a single indent to all indented paragraphs
'*** 4) Replaces all the text in italics to _text_
'*** 5) Replaces all the text in bold to **text**
'*** 6) Replaces Heading1-6 to #..#Heading (Heading numbering gets lost)
'*** 7) Replaces bulleted lists with ^p *  listitem ^p*  listitem2...
'*** 8) Replaces numbered lists with ^p 1. listitem ^p2.  listitem2...
'*** Feel free to use and redistribute this code
Sub MarkDown()
    Dim bReplace As Boolean
    Dim i As Integer
    Dim oPara As Paragraph


    'remove formatting from paragraph sign so that we dont get **blablabla^p** but rather **blablabla**^p
    Call RemoveBoldEnters


    For i = Selection.Document.Tables.Count To 1 Step -1
            Call Selection.Document.Tables(i).ConvertToText
    Next

    'simple text indent + extra paragraphs for non-numbered paragraphs
    For i = Selection.Document.Paragraphs.Count To 1 Step -1
        Set oPara = Selection.Document.Paragraphs(i)
        If oPara.Range.ListFormat.ListType = wdListNoNumbering Then
            If oPara.LeftIndent > 0 Then
                oPara.Range.InsertBefore (">")
            End If
            oPara.Range.InsertBefore (vbCrLf)
        End If


    Next

    'italic -> _italic_
    Selection.HomeKey Unit:=wdStory
    bReplace = ReplaceOneItalic  'first replacement
    While bReplace 'other replacements
        bReplace = ReplaceOneItalic
    Wend

    'bold-> **bold**
    Selection.HomeKey Unit:=wdStory
    bReplace = ReplaceOneBold 'first replacement
    While bReplace
        bReplace = ReplaceOneBold 'other replacements
    Wend



    'Heading -> ##heading
    For i = 1 To 6 'heading1 to heading6
        Selection.HomeKey Unit:=wdStory
        bReplace = ReplaceH(i) 'first replacement
        While bReplace
            bReplace = ReplaceH(i) 'other replacements
        Wend
    Next

    Call ReplaceLists


    Selection.HomeKey Unit:=wdStory
End Sub


'***************************************************************
' Function to replace bold with _bold_, only the first occurance
' Returns true if any occurance found, false otherwise
' Originally recorded by WinWord macro recorder, probably contains
' quite a lot of useless code
'***************************************************************
Function ReplaceOneBold() As Boolean
    Dim bReturn As Boolean

    Selection.Find.ClearFormatting
    With Selection.Find
        .Text = ""
        .Forward = True
        .Wrap = wdFindContinue
        .Font.Bold = True
        .Format = True
        .MatchCase = False
        .MatchWholeWord = False
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False
    End With

    bReturn = False
    While Selection.Find.Execute = True
        bReturn = True
        Selection.Text = "**" & Selection.Text & "**"
        Selection.Font.Bold = False
        Selection.Find.Execute
    Wend

    ReplaceOneBold = bReturn
End Function

'*******************************************************************
' Function to replace italic with _italic_, only the first occurance
' Returns true if any occurance found, false otherwise
' Originally recorded by WinWord macro recorder, probably contains
' quite a lot of useless code
'********************************************************************
Function ReplaceOneItalic() As Boolean
    Dim bReturn As Boolean

        Selection.Find.ClearFormatting

    With Selection.Find
        .Text = ""
        .Forward = True
        .Wrap = wdFindContinue
        .Font.Italic = True
        .Format = True
        .MatchCase = False
        .MatchWholeWord = False
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False
    End With

    bReturn = False
    While Selection.Find.Execute = True
        bReturn = True
        Selection.Text = "_" & Selection.Text & "_"
        Selection.Font.Italic = False
        Selection.Find.Execute
    Wend
    ReplaceOneItalic = bReturn
End Function

'*********************************************************************
' Function to replace headingX with #heading, only the first occurance
' Returns true if any occurance found, false otherwise
' Originally recorded by WinWord macro recorder, probably contains
' quite a lot of useless code
'*********************************************************************
Function ReplaceH(ByVal ipNumber As Integer) As Boolean
    Dim sReplacement As String

    Select Case ipNumber
    Case 1: sReplacement = "#"
    Case 2: sReplacement = "##"
    Case 3: sReplacement = "###"
    Case 4: sReplacement = "####"
    Case 5: sReplacement = "#####"
    Case 6: sReplacement = "######"
    End Select

    Selection.Find.ClearFormatting
    Selection.Find.Style = ActiveDocument.Styles("Heading " & ipNumber)
    With Selection.Find
        .Text = ""
        .Replacement.Text = ""
        .Forward = True
        .Wrap = wdFindContinue
        .Format = True
        .MatchCase = False
        .MatchWholeWord = False
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False
    End With


     bReturn = False
    While Selection.Find.Execute = True
        bReturn = True
        Selection.Range.InsertBefore (vbCrLf & sReplacement & " ")
        Selection.Style = ActiveDocument.Styles("Normal")
        Selection.Find.Execute
    Wend

    ReplaceH = bReturn
End Function



'***************************************************************
' A fix-up for paragraph marks that ar are bold or italic
'***************************************************************
Sub RemoveBoldEnters()
    Selection.HomeKey Unit:=wdStory
    Selection.Find.ClearFormatting
    Selection.Find.Font.Italic = True
    Selection.Find.Replacement.ClearFormatting
    Selection.Find.Replacement.Font.Bold = False
    Selection.Find.Replacement.Font.Italic = False
    With Selection.Find
        .Text = "^p"
        .Replacement.Text = "^p"
        .Forward = True
        .Wrap = wdFindContinue
        .Format = True
    End With
    Selection.Find.Execute Replace:=wdReplaceAll

    Selection.HomeKey Unit:=wdStory
    Selection.Find.ClearFormatting
    Selection.Find.Font.Bold = True
    Selection.Find.Replacement.ClearFormatting
    Selection.Find.Replacement.Font.Bold = False
    Selection.Find.Replacement.Font.Italic = False
    With Selection.Find
        .Text = "^p"
        .Replacement.Text = "^p"
        .Forward = True
        .Wrap = wdFindContinue
        .Format = True
    End With
    Selection.Find.Execute Replace:=wdReplaceAll
End Sub

'***************************************************************
' Function to replace bold with _bold_, only the first occurance
' Returns true if any occurance found, false otherwise
' Originally recorded by WinWord macro recorder, probably contains
' quite a lot of useless code
'***************************************************************
Sub ReplaceLists()
    Dim i As Integer
    Dim j As Integer
    Dim Para As Paragraph

    Selection.HomeKey Unit:=wdStory

    'iterate through all the lists in the document
    For i = Selection.Document.Lists.Count To 1 Step -1
        'check each paragraph in the list
        For j = Selection.Document.Lists(i).ListParagraphs.Count To 1 Step -1
            Set Para = Selection.Document.Lists(i).ListParagraphs(j)
            'if it's a bulleted list
            If Para.Range.ListFormat.ListType = wdListBullet Then
                        Para.Range.InsertBefore (ListIndent(Para.Range.ListFormat.ListLevelNumber, "*"))
            'if it's a numbered list
            ElseIf Para.Range.ListFormat.ListType = wdListSimpleNumbering Or _
                                                    wdListMixedNumbering Or _
                                                    wdListListNumOnly Then
                Para.Range.InsertBefore (Para.Range.ListFormat.ListValue & ".  ")
            End If
        Next j
        'inserts paragraph marks before and after, removes the list itself
        Selection.Document.Lists(i).Range.InsertParagraphBefore
        Selection.Document.Lists(i).Range.InsertParagraphAfter
        Selection.Document.Lists(i).RemoveNumbers
    Next i
End Sub

'***********************************************************
' Returns the MarkDown indent text
'***********************************************************
Function ListIndent(ByVal ipNumber As Integer, ByVal spChar As String) As String
    Dim i  As Integer
    For i = 1 To ipNumber - 1
        ListIndent = ListIndent & "    "
    Next
    ListIndent = ListIndent & spChar & "    "
End Function

来源:ProgTips

答案 2 :(得分:9)

如果你愿意使用.docx格式,你可以使用我放在一起的PHP脚本来提取XML,运行一些XSL转换并输出相当不错的Markdown等价物:

https://github.com/matb33/docx2md

请注意,它应该从命令行开始工作,并且在其界面中是相当基础的。但是,它将完成工作!

如果脚本无法正常运行,我建议您将.docx文件发送给我,以便我可以重现您的问题并进行修复。如果您愿意,请在GitHub中记录问题或直接与​​我联系。

答案 3 :(得分:7)

Pandoc是一个很好的命令行转换工具,但是,您首先需要将输入转换为Pandoc可以读取的格式,即:

  • 降价
  • reStructuredText
  • 纺织
  • HTML
  • 乳胶

答案 4 :(得分:3)

我们遇到了将Word文档转换为markdown的相同问题。有些是更复杂和(非常)大的文档,有数学方程式和图像等。所以我制作了这个脚本,它使用了许多不同的工具进行转换:https://github.com/Versal/word2markdown

因为它使用了几个工具链,所以它更容易出错,但如果你有更复杂的文档,它可能是一个很好的起点。希望它能有所帮助! :)

<强>更新 它目前仅适用于Mac OS X,您需要安装一些要求(Word,Pandoc,HTML Tidy,git,node / npm)。为了使其正常工作,您还需要打开一个空的Word文档,然后执行:文件 - &gt;另存为网页 - &gt;兼容性 - &gt;编码 - &gt; UTF-8。然后,此编码将保存为默认值。有关如何设置的详细信息,请参阅自述文件。

然后在控制台中运行:

$ git clone git@github.com:Versal/word2markdown.git
$ cd word2markdown
$ npm install
(copy over the Word files, for example, "document.docx")
$ ./doc-to-md.sh document.docx document_files > document.md

然后,您可以在document.md中找到Markdown,在document_files目录中找到图片。

现在它可能有点复杂,所以我欢迎任何使这更容易的贡献或使其在其他操作系统上工作! :)

答案 5 :(得分:1)

你试过这个吗?不确定功能丰富,但它适用于简单的文本。 http://markitdown.medusis.com/

答案 6 :(得分:0)

作为大学ruby课程的一部分,我开发了一个工具,可以将openoffice word文件(.odt)转换为markdown。 必须做出许多假设才能将其转换为正确的格式。例如,很难确定必须被视为标题的文本的大小。 但是,您认为可以放弃此转换的唯一方法是格式化所有符合的文本始终附加到降价文档。 我开发的工具支持列表,粗体和斜体文本,并且它具有表格的语法。

http://github.com/bostko/doc2text 试一试,请给我你的反馈。