我试图从PPTS中提取所有文本,并使用段落和换行符分隔它们。
NPOIFSFileSystem poifs = new NPOIFSFileSystem(inputStream);
PowerPointExtractor extractor = new PowerPointExtractor(poifs);
StringBuilder SB = new StringBuilder();
BufferedReader bufReader = new BufferedReader(new StringReader(extractor2.getText()));
String line = null;
while ((line = bufReader.readLine()) != null) {
if (line.trim().length() > 2) {
line = line.replaceAll(" ", "<br />");
line = line.replaceAll("\\s+", " ");
line = Normalizer.normalize(line, Normalizer.Form.NFD);
SB.append("<p>").append(line).append("</p>\r\n");
}
}
System.out.println(SB.toString());
但是,让我们说在包含一个包含多个单元格的表格的特定幻灯片中,如下所示:
使用上面的代码,输出就像这样
有没有办法正确浏览每张幻灯片,然后根据容器提取和分离文本?
实施例
<p>IBM IBM IBM MS IBM MS</p>
will become
<p>IBM</p><p>IBM</p><p>IBM</p><p>MS</p><p>IBM</p><p>MS</p>