我正在制作一个应该从提供的网址中检索网页标题的软件,并尝试使用JSoup来实现这一目标。 链接主要来自youtube,JSoup与它们完美配合,但偶尔输入的格式为pdf:http://www.ninsheetmusic.org/download/pdf/2066 当我得到以下异常时:
org.jsoup.UnsupportedMimeTypeException: Unhandled content type. Must be text/*, application/xml, or application/xhtml+xml. Mimetype=application/pdf, URL=http://www.ninsheetmusic.org/download/pdf/2066
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:689)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:628)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:260)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:249)
at core.Request.parseTitle(Request.java:54)
at core.Request.<init>(Request.java:29)
at core.GrakeBot.parseRequest(GrakeBot.java:161)
at core.GrakeBot.onMessage(GrakeBot.java:59)
at org.jibble.pircbot.PircBot.handleLine(PircBot.java:990)
at org.jibble.pircbot.InputThread.run(InputThread.java:92)
现在我认为JSoup没有处理pdf,但是我能在这里做些什么来避免这种异常并获得网页标题?
这是我现在正在使用的代码:
private String parseTitle(String link)
{
Document doc = null;
String title = "Title could not be retrieved";
if (getType() == RequestType.YOUTUBE)
{
try
{
doc = Jsoup.connect(getLink()).get();
title = doc.getElementById("eow-title").text();
} catch (IOException e)
{
e.printStackTrace();
}
return title;
}
else if (getType() == RequestType.SHEET)
{
try
{
doc = Jsoup.connect(getLink()).get();
title = doc.getElementsByTag("title").text();
} catch (IOException e)
{
e.printStackTrace();
}
return title;
}
else
return title;
}
答案 0 :(得分:0)
您无法使用JSoup。这不是HTML。我查看了源代码,您收到的数据看起来像这样:
%PDF-1.5
%Çì¢
5 0 obj
<</Length 6 0 R/Filter /FlateDecode>>
stream
xœÍ\]\[³$GqÆö1áë˜à…GœQUf\]ý† Làp8¿¬xÀ+åÐ
„1?ÉÿÒy©KöîYítŸÌîÙ¯«ºº.™YyëêßœÜÕÃÉñÿþã嫇÷œO¿üâ!~÷à®Îa©%Qá“?üƒ‡„pŠ Ã5åÓ+E.#£O~ò s5ýñù/~óP®Èÿä‚ýýòÕé? Ç—“Ï×RNü‚:Pk
åþTü5”SJpÍpúàÕÃùk—þë!Ôk§þ…áwèÞœ¨·ñÑÃùëRéZJ8=&¹·_ŽW úS}óÒ\[j¾årÙ<í-•‘&3õ"8éóªwïnÀM-\]¡ä=¾ìÚbéýMô¦èû<üè&rñÙYzQh\]ª¦\¡)ÙÙr…¦<Ù&t›*~>‘«Hq¬Ù78‹cÝ+œÅ›b…£8”h‹Åh‹Å¾$\[Üà,eS¬p{¿)V8Š\]ÞLKƒÂªïZ=bÞäef\[÷£Ï4¸$ÏO1Òo”YŒDG£´üò!æ:Ч)&ªçåžÈ÷D'“£-ƒ×{~ñ÷ú¨Ñ\[y×ôGAö7=
...
您必须使用库解析它。看完后,看起来Apache PDFBox就是你想要的。此代码来自文档并且未经测试,但它看起来像您要执行的操作:
PDDocument doc = PDDocument.load("http://www.ninsheetmusic.org/download/pdf/2066");
PDDocumentInformation info = doc.getDocumentInformation();
String title = info.getTitle();
唯一剩下的就是安装Apache PDFBox:)