使用poi在docx文件中的embed段落中提取内容

时间:2016-11-16 07:43:28

标签: java ms-word apache-poi docx

我使用poi从docx文件中提取内容。 处理文件时,所有图片都会丢失。 我检查了这个文件的格式,发现结构异常:

<w:r>
<w:p xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing">
<w:r>
<w:drawing>
<wp:anchor distT="0" distB="0" distL="114300" distR="114300" simplePos="0" relativeHeight="251658240" behindDoc="0" locked="0" layoutInCell="1" allowOverlap="1">
<wp:simplePos x="0" y="0"/>
<wp:positionH relativeFrom="column">
<wp:align>center</wp:align>
</wp:positionH>
<wp:positionV relativeFrom="paragraph">
<wp:posOffset>2540</wp:posOffset>
</wp:positionV>
<wp:extent cx="5352176" cy="1837188"/>
<wp:wrapTopAndBottom/>
<wp:docPr id="9" name="media/GIUACAFYtDB.png"/>
<a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
<a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:nvPicPr>
<pic:cNvPr id="0" name="media/GIUACAFYtDB.png"/>
<pic:cNvPicPr/>
</pic:nvPicPr>
<pic:blipFill>
<a:blip r:embed="rId9"/>
<a:stretch>
<a:fillRect/>
</a:stretch>
</pic:blipFill>
<pic:spPr>
<a:xfrm>
<a:off x="0" y="0"/>
<a:ext cx="5352176" cy="1837188"/>
</a:xfrm>
<a:prstGeom prst="rect"/>
</pic:spPr>
</pic:pic>
</a:graphicData>
</a:graphic>
</wp:anchor>
</w:drawing>
</w:r>
</w:p>
</w:r>

段落元素位于run元素中。我称之为嵌入段落,我无法找到使用poi解析嵌入段落的方法。 我该如何处理这些数据呢?

1 个答案:

答案 0 :(得分:0)

public static List<XWPFPictureData> extractPictureData(XWPFRun wrun) {
    List<XWPFPicture> pictures = wrun.getEmbeddedPictures();
    List<XWPFPictureData> result = new ArrayList<>();
    if(pictures != null && !pictures.isEmpty()) {
        for (XWPFPicture picture : pictures) {
            XWPFPictureData data = picture.getPictureData();
            if(data != null) {
                result.add(data);
            }
        }
        return result;
    }
    CTR ctr = wrun.getCTR();
    if(ctr.validate()) {    
        return result;
    }
    //this run does not obey openxml protocal.
    XWPFDocument document = wrun.getDocument();
    String xpath = "declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' " +
          ".//w:drawing";
    XmlObject[] drawings = ctr.selectPath(xpath);
    for (XmlObject drawing : drawings) {
        String blipPath = "declare namespace a='http://schemas.openxmlformats.org/drawingml/2006/main' " +
                ".//a:blip";
        XmlObject[] blips = drawing.selectPath(blipPath);
        if(blips.length == 0) {
            continue;
        }
        XmlObject blip = blips[0];
        XmlObject blipId = 
                blip.selectAttribute("http://schemas.openxmlformats.org/officeDocument/2006/relationships"
                        , "embed");
        if(blipId == null) {
            continue;
        }
        String id = ((SimpleValue)blipId).getStringValue();
        POIXMLDocumentPart relatedPart = document.getRelationById(id);
        if (relatedPart instanceof XWPFPictureData) {
            XWPFPictureData pictureData =  (XWPFPictureData) relatedPart;
            result.add(pictureData);
        }
    }
    return result;
}

它并不能解决所有问题,但它现在解决了我的问题。 我试图访问低级XmlObject并为embed段构造一个XWPFParagraph对象,但是faild。所以我只使用低级xml处理代码。