我使用poi从docx文件中提取内容。 处理文件时,所有图片都会丢失。 我检查了这个文件的格式,发现结构异常:
<w:r>
<w:p xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing">
<w:r>
<w:drawing>
<wp:anchor distT="0" distB="0" distL="114300" distR="114300" simplePos="0" relativeHeight="251658240" behindDoc="0" locked="0" layoutInCell="1" allowOverlap="1">
<wp:simplePos x="0" y="0"/>
<wp:positionH relativeFrom="column">
<wp:align>center</wp:align>
</wp:positionH>
<wp:positionV relativeFrom="paragraph">
<wp:posOffset>2540</wp:posOffset>
</wp:positionV>
<wp:extent cx="5352176" cy="1837188"/>
<wp:wrapTopAndBottom/>
<wp:docPr id="9" name="media/GIUACAFYtDB.png"/>
<a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
<a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:nvPicPr>
<pic:cNvPr id="0" name="media/GIUACAFYtDB.png"/>
<pic:cNvPicPr/>
</pic:nvPicPr>
<pic:blipFill>
<a:blip r:embed="rId9"/>
<a:stretch>
<a:fillRect/>
</a:stretch>
</pic:blipFill>
<pic:spPr>
<a:xfrm>
<a:off x="0" y="0"/>
<a:ext cx="5352176" cy="1837188"/>
</a:xfrm>
<a:prstGeom prst="rect"/>
</pic:spPr>
</pic:pic>
</a:graphicData>
</a:graphic>
</wp:anchor>
</w:drawing>
</w:r>
</w:p>
</w:r>
段落元素位于run元素中。我称之为嵌入段落,我无法找到使用poi解析嵌入段落的方法。 我该如何处理这些数据呢?
答案 0 :(得分:0)
public static List<XWPFPictureData> extractPictureData(XWPFRun wrun) {
List<XWPFPicture> pictures = wrun.getEmbeddedPictures();
List<XWPFPictureData> result = new ArrayList<>();
if(pictures != null && !pictures.isEmpty()) {
for (XWPFPicture picture : pictures) {
XWPFPictureData data = picture.getPictureData();
if(data != null) {
result.add(data);
}
}
return result;
}
CTR ctr = wrun.getCTR();
if(ctr.validate()) {
return result;
}
//this run does not obey openxml protocal.
XWPFDocument document = wrun.getDocument();
String xpath = "declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' " +
".//w:drawing";
XmlObject[] drawings = ctr.selectPath(xpath);
for (XmlObject drawing : drawings) {
String blipPath = "declare namespace a='http://schemas.openxmlformats.org/drawingml/2006/main' " +
".//a:blip";
XmlObject[] blips = drawing.selectPath(blipPath);
if(blips.length == 0) {
continue;
}
XmlObject blip = blips[0];
XmlObject blipId =
blip.selectAttribute("http://schemas.openxmlformats.org/officeDocument/2006/relationships"
, "embed");
if(blipId == null) {
continue;
}
String id = ((SimpleValue)blipId).getStringValue();
POIXMLDocumentPart relatedPart = document.getRelationById(id);
if (relatedPart instanceof XWPFPictureData) {
XWPFPictureData pictureData = (XWPFPictureData) relatedPart;
result.add(pictureData);
}
}
return result;
}
它并不能解决所有问题,但它现在解决了我的问题。 我试图访问低级XmlObject并为embed段构造一个XWPFParagraph对象,但是faild。所以我只使用低级xml处理代码。