获取段落子弹文本

时间:2017-05-18 06:52:29

标签: c# ms-word xml-parsing openxml

我在word文档中有以下文字:

  

这是一段:

     

1)这是第一个子弹

     

2)这是第二个子弹

我正在尝试获取文字1)2),但我没有成功:

foreach (var items in para)
{
    int id = items.ParagraphProperties.NumberingProperties.NumberingId.Val;
    int refval = items.ParagraphProperties.NumberingProperties.NumberingLevelReference.Val;
    var runs = items .Descendants<Run>();
    foreach (var run in runs)
    {
        var txts = run.Descendants<Text>();

        foreach (var txt in txts)
        {

        }
    }
}

访问这些值会为这两个项目符号提供以下内容:

claims.ParagraphProperties.NumberingProperties.NumberingId.Val
-> 2

claims.ParagraphProperties.NumberingProperties.NumberingLevelReference.Val
-> 0

2 个答案:

答案 0 :(得分:2)

我想我刚被Dirk Vollmar打了个书呆子,所以现在我不得不尝试用一种方法来计算一个有序列表中的“文本”。

现在,这假设Word的英文版本的行为与我的丹麦版本​​的行为大致相同,不管怎样,经过测试后,我发现有3种不同的缩进级别。

第一级是数字,第二级是字母,第三级是罗马数字。之后,级别重复,因此第四级是一个数字等等。

这意味着,为了计算列表中应该是什么文本,我们只需要知道段落的位置,在缩进级别。

这是我的解决方案。我正在使用此文档进行测试: test word document

之后,我为段落写了一个扩展方法。没有任何错误处理,它假设您传递的是实际位于列表中的段落。

public static string GetIndentionTextFromParagraph(this Paragraph paragraph)
{
    int numberingId = paragraph.ParagraphProperties.NumberingProperties.NumberingId.Val; 
    int numberingLevel = paragraph.ParagraphProperties.NumberingProperties.NumberingLevelReference.Val;
    //isolate paragraphs with the correct numbering id and indention level
    var paragraphsInList = paragraph.Parent.Descendants<Paragraph>().Where(p =>
        p.ParagraphProperties != null &&
        p.ParagraphProperties.NumberingProperties != null &&
        p.ParagraphProperties.NumberingProperties.NumberingId.Val == numberingId &&
        p.ParagraphProperties.NumberingProperties.NumberingLevelReference.Val == numberingLevel
        ).ToList();
    //find position of paragraph in list
    int paragraphPositionInLevelOfList = paragraphsInList.IndexOf(paragraph);
    //boil the level down to always being between 0 and 2 so we can chose what kind of response we want to give
    while (numberingLevel > 2)
    {
        numberingLevel = numberingLevel - 3;
    }

    if (numberingLevel == 0)
    {
        //return a number
        return (paragraphPositionInLevelOfList + 1).ToString();
    }
    else if (numberingLevel == 1)
    {
        //return a letter
        return "abcdefghijklmnopqrstuvwxyz"[paragraphPositionInLevelOfList].ToString();
    }
    else if (numberingLevel == 2)
    {
        //return roman
        return ToRoman(paragraphPositionInLevelOfList + 1);
    }
    else return "unknown list configuration";
}

现在只有测试是否有效。您希望如何隔离段落取决于您自己。为了测试它,我只是用一些独特的文本来隔离它们。

using (var wordDoc = WordprocessingDocument.Open(@"C:\test\qtest\test.docx", true))
{
    MainDocumentPart mainPart = wordDoc.MainDocumentPart;
    var document = mainPart.Document;

    Paragraph firstIndention = document.Descendants<Paragraph>().Where(i => i.InnerText.Contains("my number bullet 1")).First();
    Paragraph secondIndention = document.Descendants<Paragraph>().Where(i => i.InnerText.Contains("letter bullet 2")).First();
    Paragraph thirdIndention = document.Descendants<Paragraph>().Where(i => i.InnerText.Contains("third indention 2")).First();
    Paragraph fourthIndention = document.Descendants<Paragraph>().Where(i => i.InnerText.Contains("And we are back to numbering, so we know the rules now")).First();

    Console.WriteLine(firstIndention.GetIndentionTextFromParagraph());
    Console.WriteLine(secondIndention.GetIndentionTextFromParagraph());
    Console.WriteLine(thirdIndention.GetIndentionTextFromParagraph());
    Console.WriteLine(fourthIndention.GetIndentionTextFromParagraph());
}

这将输出:1,b,II和1.

希望这有帮助。

我从Converting integers to roman numerals

复制了“ToRoman”功能
static string ToRoman(int number)
{
    if ((number < 0) || (number > 3999)) throw new ArgumentOutOfRangeException("insert value betwheen 1 and 3999");
    if (number < 1) return string.Empty;
    if (number >= 1000) return "M" + ToRoman(number - 1000);
    if (number >= 900) return "CM" + ToRoman(number - 900); 
    if (number >= 500) return "D" + ToRoman(number - 500);
    if (number >= 400) return "CD" + ToRoman(number - 400);
    if (number >= 100) return "C" + ToRoman(number - 100);
    if (number >= 90) return "XC" + ToRoman(number - 90);
    if (number >= 50) return "L" + ToRoman(number - 50);
    if (number >= 40) return "XL" + ToRoman(number - 40);
    if (number >= 10) return "X" + ToRoman(number - 10);
    if (number >= 9) return "IX" + ToRoman(number - 9);
    if (number >= 5) return "V" + ToRoman(number - 5);
    if (number >= 4) return "IV" + ToRoman(number - 4);
    if (number >= 1) return "I" + ToRoman(number - 1);
    throw new ArgumentOutOfRangeException("something bad happened");
}

答案 1 :(得分:1)

从您的代码中我假设您尝试使用Open XML SDK获取列表项文本(而不是使用Word互操作)。

如果您解压缩文档包并查看document.xml,您将看到列表项文本未存储在文档中。它是打开文档时应用程序计算的值。遗憾的是,没有简单的方法可以使用Open XML SDK获得价值。

如果您想知道列表项文本,基本上有两个选项:

  1. 使用Word interop,它将为您提供Word计算的值(例如,使用function isRowExists( isRowInDatabase ) { console.log(isRowInDatabase); if (isRowInDatabase == true) { alert('Already in database'); } }
  2. 根据Open XML自行计算值。这将是一项工作,但您会发现MSDN文章中记录的算法:Algorithm to Assemble List Item Text