Question

我正在尝试使用apache tika（）函数获取文档内容。我能够获取.doc和.docx文件的内容，但它不能处理.pdf文件。我没有在代码中指定文档类型，但不知道为什么它不适用于.pdf文件。

这是我的代码： -

在extractDocument函数中：

    int indexedChars = -1;
    Metadata metadata = new Metadata();
    int experiance=0;
    String parsedContent;
     parsedContent = tika().parseToString(new BytesStreamInput(
                Base64.decode(document.getContent().getBytes()), false), metadata, indexedChars);
    System.out.println("parsedContent "+parsedContent);

这里我得到一个空字符串parsedContent。这是我称之为的函数。

public Document push(Document document, String userName,HttpServletRequest req)  {

    if (logger.isDebugEnabled()) logger.debug("push({})", document.getContent());
    if (document == null)
        return null;
    System.out.println("document.getContent() is "+ document.getContent()); 

    /*  
    if (document.getIndex() == null || document.getIndex().isEmpty()) {
        document.setIndex(SMDSearchProperties.INDEX_NAME);
    }
    if (document.getType() == null || document.getType().isEmpty()) {
        document.setType(SMDSearchProperties.INDEX_TYPE_DOC);
    }
     */
    getNodeClient(userName); 
    try {

        System.out.println("client is "+ userName); 
        IndexResponse response = client
                .prepareIndex(userName, document.getType(),
                        document.getId())
                .setSource(extractDocument(document)).execute()
                .actionGet();
        document.setId(response.getId());
    } catch (Exception e) {
        e.printStackTrace();
        logger.warn("Can not index document {}", document.getName());
        System.out.println("Can not index document {}"+ document.getName()+" e.getMessage() "+e.getMessage());
        //throw new RestAPIException("Can not index document : "+ document.getName() + ": "+e.getMessage());
    }
    if (logger.isDebugEnabled()) logger.debug("/push()={}", document);
    return document;
}

Answer 1

从这里得到解决方案

Error while parsing Binary Files... (mostly PDF)

下载这3个jar文件并将它们复制到lib文件夹并将它们添加到项目中。

Intent intent = new Intent(AnyActivity.this,FirstActivity.class);
intent.setFlags(Intent.FLAG_ACTIVITY_CLEAR_TOP | Intent.FLAG_ACTIVITY_SINGLE_TOP);
startActivity(intent);
finish();

Apache tika（）为pdf返回空字符串。 Java的

1 个答案: