我有一个将图像转换为word文档的OCR程序。 word文档包含所有图像的文本,我想将其拆分为单独的文件。
有没有办法在c#中这样做?
感谢
答案 0 :(得分:5)
与other answer相同,但使用IEnumerator和文档的扩展方法。
static class PagesExtension {
public static IEnumerable<Range> Pages(this Document doc) {
int pageCount = doc.Range().Information[WdInformation.wdNumberOfPagesInDocument];
int pageStart = 0;
for (int currentPageIndex = 1; currentPageIndex <= pageCount; currentPageIndex++) {
var page = doc.Range(
pageStart
);
if (currentPageIndex < pageCount) {
//page.GoTo returns a new Range object, leaving the page object unaffected
page.End = page.GoTo(
What: WdGoToItem.wdGoToPage,
Which: WdGoToDirection.wdGoToAbsolute,
Count: currentPageIndex+1
).Start-1;
} else {
page.End = doc.Range().End;
}
pageStart = page.End + 1;
yield return page;
}
yield break;
}
}
主要代码最终如下:
static void Main(string[] args) {
var app = new Application();
app.Visible = true;
var doc = app.Documents.Open(@"path\to\source\document");
foreach (var page in doc.Pages()) {
page.Copy();
var doc2 = app.Documents.Add();
doc2.Range().Paste();
}
}
答案 1 :(得分:3)
如果安装了Word,则可以使用Word对象模型从C#操作Word文档。
首先,添加对Word对象模型的引用。右键单击项目,然后Add Reference... -> COM -> Microsoft Word 14.0 Object Model
(或类似的东西,具体取决于您的Word版本)。
然后,您可以使用以下代码:
using Microsoft.Office.Interop.Word;
//for older versions of Word use:
//using Word;
namespace WordSplitter {
class Program {
static void Main(string[] args) {
//Create a new instance of Word
var app = new Application();
//Show the Word instance.
//If the code runs too slowly, you can show the application at the end of the program
//Make sure it works properly first; otherwise, you'll get an error in a hidden window
//(If it still runs too slowly, there are a few other ways to reduce screen updating)
app.Visible = true;
//We need a reference to the source document
//It should be possible to get a reference to an open Word document, but I haven't tried it
var doc = app.Documents.Open(@"path\to\file.doc");
//(Can also use .docx)
int pageCount = doc.Range().Information[WdInformation.wdNumberOfPagesInDocument];
//We'll hold the start position of each page here
int pageStart = 0;
for (int currentPageIndex = 1; currentPageIndex <= pageCount; currentPageIndex++) {
//This Range object will contain each page.
var page = doc.Range(pageStart);
//Generally, the end of the current page is 1 character before the start of the next.
//However, we need to handle the last page -- since there is no next page, the
//GoTo method will move to the *start* of the last page.
if (currentPageIndex < pageCount) {
//page.GoTo returns a new Range object, leaving the page object unaffected
page.End = page.GoTo(
What: WdGoToItem.wdGoToPage,
Which: WdGoToDirection.wdGoToAbsolute,
Count: currentPageIndex + 1
).Start - 1;
} else {
page.End = doc.Range().End;
}
pageStart = page.End + 1;
//Copy and paste the contents of the Range into a new document
page.Copy();
var doc2 = app.Documents.Add();
doc2.Range().Paste();
}
}
}
}
答案 2 :(得分:0)
不容易在Word文档结束,尽管Word使用w:lastRenderedPageBreak创建文档。
最好让您的OCR程序在每个转换文本块之间的文档中插入一些标记。
然后,根据它的Word文档类型,使用适当的工具处理文件。