我在具有URL的弹性搜索中有一个索引,我需要使用apache tika来提取URL,因为每当我运行Java应用程序时,它都应该为我提供该URL用于的网页,
我尝试了代码,但得到的是以该网址以html格式编写的纯文本
HttpGet httpget = new HttpGet("url");
HttpEntity entity = null;
HttpClient client = new DefaultHttpClient();
HttpResponse response = client.execute(httpget);
entity = response.getEntity();
if (entity != null) {
InputStream instream = entity.getContent();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
Parser parser = new AutoDetectParser();
parser.parse( instream, handler, metadata, new ParseContext());
String plainText = handler.toString();
FileWriter writer = new FileWriter( "./tessdata/output.html");
writer.write( plainText );
writer.close();
System.out.println( "done");
}
我希望在运行Java应用程序时显示确切的网页 例如,如果我在url中点击google.com,则在运行该应用程序时我应该进入Google页面