Question

所以，我有一个办公室文字文件。我需要通过文档获取所有“单词”。然后我想把所有这些“单词”扔在List对象中。这是我当前问题的第一部分。

第2部分.... 资源有效的方式我可以比较2个“单词”，看看它们是否匹配。我发现这个.ddl我不知道它是不是正确的... http://diffplex.codeplex.com/

我在这个问题here中找到了这段代码，但是依赖于在我的服务器上安装office，这是一个Web应用程序..

Answer 1

我会提供一个答案，假设我对你的问题了解得足够好，对我而言是：

第1部分：阅读word文档。

第2部分：比较单词文档，我将其解释为比较从每个单词解析的字符串。

//Part 1
var applicationWord = new Microsoft.Office.Interop.Word.Application();
object filename = @"C:\Users\Omri\Desktop\Test.docx";
object missing = System.Reflection.Missing.Value;
Microsoft.Office.Interop.Word._Document oDoc;

oDoc = applicationWord.Documents.Open(filename, ref missing, ref missing, ref missing,
                 ref missing, ref missing, ref missing, ref missing,
                 ref missing, ref missing, ref missing, ref missing, ref missing, ref missing,
                 ref missing, ref missing);

这只是打开word文档。你可能必须在你的服务器上安装办公室，我看到你很好。

现在正在解析字符串 - 我提供了一个正则表达式来删除字符串中的所有非字母数字字符，这是我的工作假设。你可以根据自己的喜好改变它。还用图片等进行了测试，它将它们删除。

var wordsList = new List<string>();
foreach (var range in oDoc.StoryRanges)
{
      var tempString = range.Text;
      tempString = Regex.Replace(tempString, @"[^a-zA-Z0-9 -]", string.Empty);
      wordsList.AddRange(tempString.Split(new char[] { ' ' } ).ToList());
}

这将为您提供单词文档中的字符串列表。

//Part 2
bool wordMatch = data.Count == data2.Count; //quick initial check;
//you can check for wordMatch == false and return immediately

wordsList.Sort(); //sorting in lexicographical order so will be easy to compare
wordsList2.Sort(); //this is a 2nd word document
for (int i = 0; i < data.Count; i++)
{
       if (data.ElementAt(i) != data2.ElementAt(i))
       {
           wordMatch = false;
           break;
       }
}

我认为这就是它，希望它可以帮助你。

Answer 2

好的，由于提供的信息有限且没有澄清，也许您可以这样做：

首先获得Open XML SDK 2.5（2.5MB）。（您实际上不需要在服务器上安装它，只需要引用一个.dll - ＆gt; DocumentFormat.OpenXml.dll。

现在您可以写下以下内容：

// This will extract the contents of a word doc and place them in an IEnumerable
// covering your first problem
private IEnumerable<string> ExtractContents(string filePath)
{
    using (var stream = File.Open(filePath, FileMode.Open))
    using (var document = WordprocessingDocument.Open(stream, true))
    {
        string content = document.MainDocumentPart.Document.Body.InnerText;
        return content.Split(new[] {' '}, StringSplitOptions.RemoveEmptyEntries);
    }
}

然后你可以比较两个列表：

IEnumerable<string> doc1 = ExtractContents(@"C:\doc1.docx");
IEnumerable<string> doc2 = ExtractContents(@"C:\doc2.docx");

if (doc1.SequenceEqual(doc2))
{
    // same content in both !
}

我希望我没有误会你。

Answer 3

不幸的是，我只能在这里提供部分答案，因为我无法解决的问题需要比我聪明的人:-) 也许您只是发现“问题”部分很有帮助，因为您可能希望在开始实施解决方案之前对此进行研究。

用例

我想知道你的问题背后可能有什么用例，以便更好地理解它。我在想以下几点：用户（可能是教授）将2个Word文档（可能是论文文档）上传到网站，以确定它们是否是彼此的副本。您希望通过单词列表进行比较，因为文档可能包含相同的内容但顺序不同（撰写论文的学生已经重新排序了句子）。网站用户可能希望对文档内容有不同的观点，但不是Microsoft Word本身内置的观点，因为a）用户可以直接比较Word中的文档，因此不需要使用网站和b）如果句子已经重新排序，Word中的比较就没用了。

那么，什么可能有用呢？正如您已经提到的，我们可以从单词文档中提取单词列表。首先，我们要计算单词，以便我们以后可以计算百分比的差异。然后我们比较两个列表并找出差异。我们仍然得到两个列表：1）文档A中但不在文档B中的单词，2）反之亦然。显示这两个列表可能已经很有趣了，但似乎您希望更进一步：将Word文档作为一个整体显示并突出显示差异。

问题

阅读Word文档。这应该很容易。我在这里建议Codeplex DocX，因为我以前使用它，它不需要安装Office。但是，它仅适用于Docx文件格式。
提取单词列表。有点棘手，因为我们需要考虑单词分隔符，这可能不仅仅是空格。
将列表简化为差异。这应该很容易了。
显示差异。如果你想要做对，可能是最棘手的部分。您已经为比较组件提出了建议。但是，我担心这样的组件使用内置机制来查找差异 - 这正是我不想要的，考虑到我的用例。但是还有一个更糟糕的问题：突出显示正确位置的不同单词。我需要一个例子来描述它：

Word文档A

这是一个由学生复制的句子。然后它包含别的东西在这里。

Word文档B

这是一个由学生复制的句子。然后它包含一个第二句。

文档A中不在文档B中的单词：其他，此处。

文件B中不在文件A中的单词：a，第二句，句子。

让我们现在突出重点。在文档A中，这不是问题：只需突出显示单词即可。然而，在文件B中，我们如何知道是否在第一句或第二句中突出显示“a”和“句子”一词？

当然，对于人类来说，当第一句话是相同的时，很明显差异在第二句中。这样的事情可能已经（或者至少试图）通过那里的所有差异工具来解决，我只是想提一下，根据用例，可能很重要的是不要将单词文档分解为一个简单的单词列表。

我的解决方案

阅读word文档：

using Novacode;
DocX.Load(Path.Combine(mydocumentPath, @"DocumentA.docx"));

提取单词列表：

private static IList<string> ExtractWords(DocX doc)
{
    // Get all the text without the headlines
    var text = doc.Text;
    // Maybe you want to tweak this and more items or use a different approach,
    // e.g. using Regex like proposed by @Omri Aharon
    var separators = "\r\n\"”“„?«» .!?,{}[]-".ToCharArray();
    var strings = text.Split(separators, StringSplitOptions.RemoveEmptyEntries);
    return strings.ToList();
}

将列表缩小为差异：

    private static void ReduceToDifferences(IList<string> wordsA, IList<string> wordsB)
    {
        // Maybe there's some optimization possible, 
        // e.g. loop over the list with fewest entries or something
        for (int i = wordsA.Count - 1; i >= 0; i--)
        {
            // Find word of list A in list B
            var word = wordsA[i];
            var index = wordsB.IndexOf(word);

            // If found, remove it in both
            if (index > -1)
            {
                wordsB.RemoveAt(index);
                wordsA.RemoveAt(i);
            }
        }
    }

显示差异：如上所述，这只是部分答案，对于上述问题，我没有为此部分提供解决方案。

简短，自包含，正确的例子

为了完整起见并提供SSCCE，这是我的其余应用程序，以使其工作。

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using Novacode;

static void Main(string[] args)
{
    var wordsA = ExtractWords(DocX.Load(Path.Combine(mydocumentPath, @"DocumentA.docx")));
    var wordsB = ExtractWords(DocX.Load(Path.Combine(mydocumentPath, @"DocumentB.docx")));
    ReduceToDifferences(wordsA, wordsB);

    Console.WriteLine("------ Words only in A:");
    PrintList(wordsA);
    Console.WriteLine("------ Words only in B:");
    PrintList(wordsB);
    Console.WriteLine("------ Press any key");
    Console.ReadLine();
}

private static void PrintList(IList<string> wordsA)
{
    foreach (var word in wordsA)
    {
        Console.WriteLine(word);
    }
}

MVC4 c＃取一个office word文档，并转换为list

3 个答案: