使用BufferedReader进行URLConnection解析打印Gibberish,尝试解决它导致URLConnection.getInputStream返回fileNotFoundException

时间:2015-11-02 11:25:04

标签: asp.net-mvc character-encoding html-parsing inputstream filenotfoundexception

我有以下Java代码来解析网站代码:

URL url = new URL(urlToParse);
URLConnection con = url.openConnection();
InputStream is =con.getInputStream(); 
BufferedReader br = new BufferedReader(new InputStreamReader(is));

urlToParse作为参数传递给此函数,等于“http://www.omegatiming.com/file/download/?id=00010F0200FFFFFFFFFFFFFFFFFFFF03”。
代码来自here 输出 Gibberish - 充满问号和未知字符。

我尝试在 openConnection ()行之后添加这5行。

con.setRequestMethod("GET");
con.setDoOutput(true);
con.setReadTimeout(2000);
con.setChunkedStreamingMode(0);
con.connect();  
提供here的解决方案

,但后来我得到了这个例外:
 线程“main”中的异常java.io.FileNotFoundException:http://www.omegatiming.com/file/download/?id=00010F0200FFFFFFFFFFFFFFFFFFFF03     at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1835)     at the sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1440)来自 InputStream行= con.getInputStream();

将此链接复制到浏览器会将我引导至该网站,因此该网站可能无效,但调用 con.getresposeCode()会返回404.

当试图从 getErrorStream()中获取错误时,它会打印出来:

<!DOCTYPE html>
<html>
    <head>
    <title>The resource cannot be found.</title>
    <meta name="viewport" content="width=device-width" />
    <style>
     body {font-family:"Verdana";font-weight:normal;font-size: .7em;color:black;} 
     p {font-family:"Verdana";font-weight:normal;color:black;margin-top: -5px}
     b {font-family:"Verdana";font-weight:bold;color:black;margin-top: -5px}
     H1 { font-family:"Verdana";font-weight:normal;font-size:18pt;color:red }
     H2 { font-family:"Verdana";font-weight:normal;font-size:14pt;color:maroon }
     pre {font-family:"Consolas","Lucida Console",Monospace;font-size:11pt;margin:0;padding:0.5em;line-height:14pt}
     .marker {font-weight: bold; color: black;text-decoration: none;}
     .version {color: gray;}
     .error {margin-bottom: 10px;}
     .expandable { text-decoration:underline; font-weight:bold; color:navy; cursor:hand; }
     @media screen and (max-width: 639px) {
      pre { width: 440px; overflow: auto; white-space: pre-wrap; word-wrap: break-word; }
     }
     @media screen and (max-width: 479px) {
      pre { width: 280px; }
     }
    </style>
</head>

<body bgcolor="white">

        <span><H1>Server Error in '/' Application.<hr width=100% size=1 color=silver></H1>

        <h2> <i>The resource cannot be found.</i> </h2></span>

        <font face="Arial, Helvetica, Geneva, SunSans-Regular, sans-serif ">

        <b> Description: </b>HTTP 404. The resource you are looking for (or one of its dependencies) could have been removed, had its name changed, or is temporarily unavailable. &nbsp;Please review the following URL and make sure that it is spelled correctly.
        <br><br>

        <b> Requested URL: </b>/file/download/<br><br>

        <hr width=100% size=1 color=silver>

        <b>Version Information:</b>&nbsp;Microsoft .NET Framework Version:4.0.30319; ASP.NET Version:4.0.30319.34248

        </font>

</body>  

 HttpException:  A public action method &#39;download&#39; was not found on controller     &#39;SwissTiming.DocMgmt.DMSWeb.Controllers.FileController&#39;.
at System.Web.Mvc.Controller.HandleUnknownAction(String actionName)
at System.Web.Mvc.Controller.<BeginExecuteCore>b__1d(IAsyncResult asyncResult, ExecuteCoreState innerState)
at System.Web.Mvc.Async.AsyncResultWrapper.WrappedAsyncVoid`1.CallEndDelegate(IAsyncResult asyncResult)
at System.Web.Mvc.Async.AsyncResultWrapper.WrappedAsyncResultBase`1.End()
at System.Web.Mvc.Controller.EndExecuteCore(IAsyncResult asyncResult)
at System.Web.Mvc.Controller.<BeginExecute>b__15(IAsyncResult asyncResult, Controller controller)
at System.Web.Mvc.Async.AsyncResultWrapper.WrappedAsyncVoid`1.CallEndDelegate(IAsyncResult asyncResult)
at System.Web.Mvc.Async.AsyncResultWrapper.WrappedAsyncResultBase`1.End()
at System.Web.Mvc.Controller.EndExecute(IAsyncResult asyncResult)
at System.Web.Mvc.Controller.System.Web.Mvc.Async.IAsyncController.EndExecute(IAsyncResult asyncResult)
at System.Web.Mvc.MvcHandler.<BeginProcessRequest>b__5(IAsyncResult asyncResult, ProcessRequestState innerState)
at System.Web.Mvc.Async.AsyncResultWrapper.WrappedAsyncVoid`1.CallEndDelegate(IAsyncResult asyncResult)
at System.Web.Mvc.Async.AsyncResultWrapper.WrappedAsyncResultBase`1.End()
at System.Web.Mvc.MvcHandler.EndProcessRequest(IAsyncResult asyncResult)
at System.Web.Mvc.MvcHandler.System.Web.IHttpAsyncHandler.EndProcessRequest(IAsyncResult result)
at System.Web.HttpApplication.CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute()
at System.Web.HttpApplication.ExecuteStep(IExecutionStep step, Boolean& completedSynchronously)
--><!-- 
This error page might contain sensitive information because ASP.NET is configured to show verbose error messages using &lt;customErrors mode="Off"/&gt;. Consider using &lt;customErrors mode="On"/&gt; or &lt;customErrors mode="RemoteOnly"/&gt; in production environments.-->  

这基本上就是我陷入困境的地方,根本无法理解这个问题。我甚至不知道ASP.NET来自哪里。

其他试图绕过没有解决问题的问题:
1.添加
httpConnection.setRequestProperty(“User-Agent”,“Mozilla / 5.0(compatible)”);
httpConnection.setRequestProperty( “接受”, “
/ ”); 的,
 按照建议here。还尝试使用this中的userAgent建议here 仍然在 getInputStream ()中获取 FileNotFoundException
2.添加 * System.setProperty(“http.agent”,“”); *
如上所述here 3.回到原来的问题(打印Gibberish) - 我尝试用这种方式改变对InputStreamReader的调用:
 新的InputStreamReader(新的URL(“www.website.com”)。openStream(),“UTF-8”)如评论here中所述,但它没有改变任何东西。
4.添加行:
con.setRequestMethod( “POST”);  con.setDoInput(真);
仍然得到fileNotFoundException。

我很困惑。

我甚至不确定我是否有编码问题(因为在尝试通过向连接添加内容来解决之前,没有例外,“只是”输出错误。) 或者我有一些其他问题,我无法从中获取输入(如果是这样,这个特定网站有什么特别之处,因为引导我进入这个网站的网站,例如http://www.omegatiming.com/Competition?id=00010F0200FFFFFFFFFFFFFFFFFFFFFF&sport=AQ&year=2015,可能是解析没有问题)。

[[here] [1]:Using Java to pull data from a webpage?
[这里] [2]:Trying to read from a URL(in Java) produces gibberish on certain occaisions
[这里] [3]:URLConnection FileNotFoundException for non-standard HTTP port sources
[这里] [4]:Setting "User-Agent" parameters for URLConnection for querying Google from a Java application
[这里] [5]:Setting user agent of a java URLConnection
[这里] [6]:Trying to read from a URL(in Java) produces gibberish on certain occaisions

[this] [1]:http://www.whatsmyuseragent.com/

1 个答案:

答案 0 :(得分:0)

管理以避免必须直接从Web解析文件。

我通过将here写入this的依赖项添加到我的pom.xml并运行 mvn clean install 来获得 pdfbox
 然后将文件下载到我的电脑中,信息是here帖子 然后(现在我有pdfbox)我添加了这3行:

 PDDocument pdf = PDDocument.load(new File(“sample.pdf”));
 PDFTextStripper stripper = new PDFTextStripper();
 String plainText = stripper.getText(pdf);

mentioed http://pdfbox.apache.org/2.0/getting-started.html

这不是完美的解决方案,它在我的PC中消耗内存来存储文件(可能只能存储一个文件并且每次都删除,但仍未检查它)并且可能会消耗太多的内存程序必须通过 getText()方法完成解析整个文件,但它解决了我的问题,即如何解析这个特定的网站,这对我的程序来说只对于提取其中的文本很重要

[here] [1]:http://blog.e-zest.net/extracting-text-from-a-pdf-file/
[here] [2]:How to download a PDF from a given URL in Java?

[this] [1]:look