Question

有没有办法在不使用Microsoft COM组件的情况下将Microsoft Word文档翻译成字符串？我希望还有其他方法来处理所有过多的标记。

编辑12/13/13：我们不想引用com组件，因为如果客户没有安装完全相同的Office版本，它将无法工作。幸运的是，微软已经使2013 word.interop.dll向后兼容。现在我们不必担心这个限制。引用dll后，我们可以执行以下操作：

/// <summary>Gets the content of the word document</summary>
/// <param name="filePath">The path to the word document file</param>
/// <returns>The content of the document</returns>
public string ExtractText(string filePath)
{
    if (string.IsNullOrEmpty(filePath))
        throw new ArgumentNullException("filePath", "Input file path not specified.");

    if (!File.Exists(filePath))
        throw new FileNotFoundException("Input file not found at specified path.", "filepath");

    var resultText = string.Empty;
    Application wordApp = null;

    try
    {
        wordApp = new Application();
        var doc = wordApp.Documents.Open(filePath, Type.Missing, true);
        if (doc != null)
        {
            if (doc.Content != null && !string.IsNullOrEmpty(doc.Content.Text))
              resultText = doc.Content.Text.Normalize();

            doc.Close();
        }
    }
    finally
    {
        if (wordApp != null)
            wordApp.Quit(false, Type.Missing, false);
    }

    return resultText;
}

Answer 1

您需要使用一些库来实现您的目标：

MS provides the OpenXML SDK V 2.0（免费，仅限DOCX）
Aspose.Words（商业，DOC和DOCX）

如果你手上有很多时间，那么写一个.DOC解析器可能是可以想象的 - 可以找到.DOC规范here。

顺便说一句：MS在类似服务器的场景（如ASP.NET或Windows服务或类似场景）中不支持Office Interop - 请参阅http://support.microsoft.com/default.aspx?scid=kb;EN-US;q257757#kb2！

Answer 2

假设您要提取doc文件的文本内容，可以使用一些命令行工具以及商业库。我们曾经用来搜索doc（而不是docx）文件（与搜索引擎sphider结合使用）的一个相当古老的工具是catdoc（也是here）这是一个DOS而不是Windows工具，但仍然只要我们符合先决条件（文件名格式8.3），我们就会为我们工作。

商品doc2txt，如果你买得起29美元。

对于较新的docx格式，您可以使用基于Perl的工具docx2txt。

当然，如果你想从c＃运行这些工具，你需要触发一个外部流程 - 检查here以获得可靠的解释。

访问doc和docx内容的一个相当昂贵但非常强大的工具是Spire.doc，但它比你需要的要多得多。使用起来比较方便，因为它是一个.NET库。

Answer 3

如果您指的是较旧的DOC文件格式，那么这是一个很大的问题，因为它是MS指定的二进制文件格式，我必须说我完全同意RQDQ的评论。

但是，如果您指的是DOCX文件格式，那么您可以在没有MS COM组件或任何其他组件的情况下实现此目的，只需要纯.NET。

检查以下解决方案：

http://www.codeproject.com/Articles/20529/Using-DocxToText-to-Extract-Text-from-DOCX-Files http://www.dotnetspark.com/kb/Content.aspx?id=5633

如何将.doc翻译成字符串？

3 个答案: