Question

我有一个将图像转换为word文档的OCR程序。 word文档包含所有图像的文本，我想将其拆分为单独的文件。

有没有办法在c＃中这样做？

感谢

Answer 1

与other answer相同，但使用IEnumerator和文档的扩展方法。

static class PagesExtension {
    public static IEnumerable<Range> Pages(this Document doc) {
        int pageCount = doc.Range().Information[WdInformation.wdNumberOfPagesInDocument];
        int pageStart = 0;
        for (int currentPageIndex = 1; currentPageIndex <= pageCount; currentPageIndex++) {
            var page = doc.Range(
                pageStart
            );
            if (currentPageIndex < pageCount) {
                //page.GoTo returns a new Range object, leaving the page object unaffected
                page.End = page.GoTo(
                    What: WdGoToItem.wdGoToPage,
                    Which: WdGoToDirection.wdGoToAbsolute,
                    Count: currentPageIndex+1
                ).Start-1;
            } else {
                page.End = doc.Range().End;
            }
            pageStart = page.End + 1;
            yield return page;
        }
        yield break;
    }
}

主要代码最终如下：

static void Main(string[] args) {
    var app = new Application();
    app.Visible = true;
    var doc = app.Documents.Open(@"path\to\source\document");
    foreach (var page in doc.Pages()) {
        page.Copy();
        var doc2 = app.Documents.Add();
        doc2.Range().Paste();
    }
}

Answer 2

如果安装了Word，则可以使用Word对象模型从C＃操作Word文档。

首先，添加对Word对象模型的引用。右键单击项目，然后Add Reference... -> COM -> Microsoft Word 14.0 Object Model（或类似的东西，具体取决于您的Word版本）。

然后，您可以使用以下代码：

using Microsoft.Office.Interop.Word;
//for older versions of Word use:
//using Word;

namespace WordSplitter {
    class Program {
        static void Main(string[] args) {
            //Create a new instance of Word
            var app = new Application();

            //Show the Word instance.
            //If the code runs too slowly, you can show the application at the end of the program
            //Make sure it works properly first; otherwise, you'll get an error in a hidden window
            //(If it still runs too slowly, there are a few other ways to reduce screen updating)
            app.Visible = true;

            //We need a reference to the source document
            //It should be possible to get a reference to an open Word document, but I haven't tried it
            var doc = app.Documents.Open(@"path\to\file.doc");
            //(Can also use .docx)

            int pageCount = doc.Range().Information[WdInformation.wdNumberOfPagesInDocument];

            //We'll hold the start position of each page here
            int pageStart = 0;

            for (int currentPageIndex = 1; currentPageIndex <= pageCount; currentPageIndex++) {
                //This Range object will contain each page.
                var page = doc.Range(pageStart);

                //Generally, the end of the current page is 1 character before the start of the next.
                //However, we need to handle the last page -- since there is no next page, the 
                //GoTo method will move to the *start* of the last page.
                if (currentPageIndex < pageCount) {
                    //page.GoTo returns a new Range object, leaving the page object unaffected
                    page.End = page.GoTo(
                        What: WdGoToItem.wdGoToPage,
                        Which: WdGoToDirection.wdGoToAbsolute,
                        Count: currentPageIndex + 1
                    ).Start - 1;
                } else {
                    page.End = doc.Range().End;
                }
                pageStart = page.End + 1;

                //Copy and paste the contents of the Range into a new document
                page.Copy();
                var doc2 = app.Documents.Add();
                doc2.Range().Paste();
            }
        }
    }
}

参考：Word Object Model Overview on MSDN

Answer 3

不容易在Word文档结束，尽管Word使用w：lastRenderedPageBreak创建文档。

最好让您的OCR程序在每个转换文本块之间的文档中插入一些标记。

然后，根据它的Word文档类型，使用适当的工具处理文件。

如何将Word文档的页面拆分为c＃中的单独文件

3 个答案: