如何从.docx文件中提取编号和文本

时间:2016-06-28 05:08:37

标签: java apache-poi extract docx

如何使用Java和Apache POI XWPF库从.docx文件中提取编号和文本?

我使用以下代码:

public static void readDocxFile() {

    try {
        File file = new File("C:\\test.docx");
        FileInputStream fis = new FileInputStream(file.getAbsolutePath());
        XWPFDocument document = new XWPFDocument(fis);
        List<XWPFParagraph> paragraphs = document.getParagraphs();

        for (XWPFParagraph para : paragraphs) {
            System.out.println(para.getText());

            fis.close();
        }
    } catch (Exception e) {
        e.printStackTrace();
    }
}

我的代码只提取文字,如下所示:

CLIENT SERVICE SATISFACTION
Client Feedback System
Interlibrary Loans
Shelf Tidiness
Three Day Loans
Materials Availability Survey
Online help service

我需要用文本提取章节编号(编号),如下所示:

1    CLIENT SERVICE SATISFACTION
1.1   Client Feedback System
1.1.1 Interlibrary Loans
1.1.2 Shelf Tidiness
1.1.3 Three Day Loans
1.2   Materials Availability Survey
1.3   Online help service

1 个答案:

答案 0 :(得分:0)

要获取doc文件的文本,您需要使用XWFParagraph(使用poi-ooxml API)方法。要获得该段落的编号,请尝试以下代码:

BigInteger currentParagraphNumberingID = currentPara_Line.getCTP().getPPr().getNumPr().getNumId().getVal(); 
BigInteger currentParagraphAbstractNumID2 = currentPara_Line.getDocument().getNumbering().getAbstractNumID(currentParagraphNumberingID);
XWPFAbstractNum currentParagraphAbstractNum = currentPara_Line.getDocument().getNumbering().getAbstractNum(currentParagraphAbstractNumID2); 
CTAbstractNum currentParagraphAbstractNumFormatting = currentParagraphAbstractNum.getCTAbstractNum();