使用TIKA从URL中提取文本

时间:2011-07-11 21:30:23

标签: java apache-tika

是否可以使用Tika从URL中提取文本?任何链接将不胜感激。或者TIKA仅适用于pdf,word和任何其他媒体文件?

4 个答案:

答案 0 :(得分:7)

检查documentation - 是的,你可以。

实施例

java -jar tika-app-0.9.jar -t http://stackoverflow.com/questions/6656849/extract-the-text-from-url-using-tika

将显示此页面上的文字。

答案 1 :(得分:6)

这是lucid

InputStream input = new FileInputStream(new File(resourceLocation));
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
PDFParser parser = new PDFParser();
parser.parse(input, textHandler, metadata);
input.close();
out.println("Title: " + metadata.get("title"));
out.println("Author: " + metadata.get("Author"));
out.println("content: " + textHandler.toString());

您可以使用Tika的PDFParser自动处理差异类型的文件,而不是创建AutoDetectParser

Parser parser = new AutoDetectParser();

答案 2 :(得分:3)

是的,你可以这样做。这是代码。此代码使用apache http client

HttpGet httpget = new HttpGet("http://url.here"); 
    HttpEntity entity = null;
    HttpClient client = new DefaultHttpClient();
    HttpResponse response = client.execute(httpget);
    entity = response.getEntity();
    if (entity != null) {
        InputStream instream = entity.getContent();
        BodyContentHandler handler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        Parser parser = new AutoDetectParser();
        parser.parse( instream, handler, metadata, new ParseContext());
        String plainText = handler.toString();
        FileWriter writer = new FileWriter( "/scratch/cache/output.txt");
        writer.write( plainText );
        writer.close();
        System.out.println( "done");
    }

答案 3 :(得分:1)

从URL中提取内容而非本地文件使用此代码:

    byte[] raw = content.getContent();
    ContentHandler handler = new BodyContentHandler();
    Metadata metadata = new Metadata();
    Parser parser = new AutoDetectParser();
    parser.parse(new ByteArrayInputStream(raw), handler, metadata, new ParseContext());
    LOG.info("content: " + handler.toString());