我试图提取包含混合文件的大型数据集的内容(pdf
,doc
,ppt
)。
我使用tika-app-1.12.jar
,当T运行我的代码时,一切都完美无缺,然后我收到了此错误
Exception in thread "main" org.apache.tika.exception.TikaException:
Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser@3ea25501 at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:258)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at
recruitmentprototyp.RecruitmentPrototyp.tikareadDoc(RecruitmentPrototyp.java:135)
at
recruitmentprototyp.RecruitmentPrototyp.doForAll(RecruitmentPrototyp.java:110)
at
recruitmentprototyp.RecruitmentPrototyp.main(RecruitmentPrototyp.java:897)
Caused by: java.lang.IllegalStateException: Pap style 19 claimed to
have itself as its parent, which isn't allowed at
org.apache.poi.hwpf.model.StyleSheet.createPap(StyleSheet.java:232)
at org.apache.poi.hwpf.model.StyleSheet.<init>(StyleSheet.java:120)
at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:346) at
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:81)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
... 5 more Java Result: 1
我该怎么办?!!