Question

使用下面实现Tika的代码片段（文章对象是我自己的），我发现了重定向到最终页面的URL，我相信通过jQuery.extend命令。

URL articleURL = new URL(article.getLink());
stream = TikaInputStream.get(articleURL);
articleBytes = IOUtils.toByteArray(stream);
if (articleBytes.length == 0) {
    return null;
} else {
    article.setContentLength((long) articleBytes.length);
}

ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
AutoDetectParser parser = new AutoDetectParser();
ParseContext context = new ParseContext();

parser.parse(new ByteArrayInputStream(articleBytes), new BoilerpipeContentHandler(textHandler), metadata, context);

Tika正好遵循重定向网址，但我想知道最终的网址是什么。有没有办法从Tika获得实际的最终网址？

其中包含重定向的示例网址为：

http://sbs.feedsportal.com/c/34692/f/637529/s/4d7e2cd0/sc/14/l/0L0Ssbs0N0Bau0Cnews0Carticle0C20A160C0A20C110Cscientists0Emaking0Ezika0Edetection0Ekits/story01.htm--2016-02-27

Answer 1

基于这个答案：https://stackoverflow.com/a/5270162/4471711

我使用了以下代码：

URLConnection con = new URL(article.getLink()).openConnection();
con.connect();
stream = TikaInputStream.get(con.getInputStream());
articleBytes = IOUtils.toByteArray(stream);
article.setLink(con.getURL().toExternalForm());

con.getURL（）。toExternalForm（）返回了新的（重定向的）网址。

Apache Tika - 如何访问重定向URL

1 个答案: