Question

我有以下Java代码来解析网站代码：

URL url = new URL(urlToParse);
URLConnection con = url.openConnection();
InputStream is =con.getInputStream(); 
BufferedReader br = new BufferedReader(new InputStreamReader(is));

urlToParse作为参数传递给此函数，等于“http://www.omegatiming.com/file/download/?id=00010F0200FFFFFFFFFFFFFFFFFFFF03”。
代码来自here 输出 Gibberish - 充满问号和未知字符。

我尝试在 openConnection （）行之后添加这5行。

con.setRequestMethod("GET");
con.setDoOutput(true);
con.setReadTimeout(2000);
con.setChunkedStreamingMode(0);
con.connect();

提供here的解决方案

，但后来我得到了这个例外：
线程“main”中的异常java.io.FileNotFoundException：http://www.omegatiming.com/file/download/?id=00010F0200FFFFFFFFFFFFFFFFFFFF03 at sun.net.www.protocol.http.HttpURLConnection.getInputStream0（HttpURLConnection.java:1835） at the sun.net.www.protocol.http.HttpURLConnection.getInputStream（HttpURLConnection.java:1440）来自 InputStream行= con.getInputStream（）;

将此链接复制到浏览器会将我引导至该网站，因此该网站可能无效，但调用 con.getresposeCode（）会返回404.

当试图从 getErrorStream（）中获取错误时，它会打印出来：

<!DOCTYPE html>
<html>
    <head>
    <title>The resource cannot be found.</title>
    <meta name="viewport" content="width=device-width" />
    <style>
     body {font-family:"Verdana";font-weight:normal;font-size: .7em;color:black;} 
     p {font-family:"Verdana";font-weight:normal;color:black;margin-top: -5px}
     b {font-family:"Verdana";font-weight:bold;color:black;margin-top: -5px}
     H1 { font-family:"Verdana";font-weight:normal;font-size:18pt;color:red }
     H2 { font-family:"Verdana";font-weight:normal;font-size:14pt;color:maroon }
     pre {font-family:"Consolas","Lucida Console",Monospace;font-size:11pt;margin:0;padding:0.5em;line-height:14pt}
     .marker {font-weight: bold; color: black;text-decoration: none;}
     .version {color: gray;}
     .error {margin-bottom: 10px;}
     .expandable { text-decoration:underline; font-weight:bold; color:navy; cursor:hand; }
     @media screen and (max-width: 639px) {
      pre { width: 440px; overflow: auto; white-space: pre-wrap; word-wrap: break-word; }
     }
     @media screen and (max-width: 479px) {
      pre { width: 280px; }
     }
    </style>
</head>

<body bgcolor="white">

        <span><H1>Server Error in '/' Application.<hr width=100% size=1 color=silver></H1>

        <h2> <i>The resource cannot be found.</i> </h2></span>

        <font face="Arial, Helvetica, Geneva, SunSans-Regular, sans-serif ">

        <b> Description: </b>HTTP 404. The resource you are looking for (or one of its dependencies) could have been removed, had its name changed, or is temporarily unavailable. &nbsp;Please review the following URL and make sure that it is spelled correctly.
        <br><br>

        <b> Requested URL: </b>/file/download/<br><br>

        <hr width=100% size=1 color=silver>

        <b>Version Information:</b>&nbsp;Microsoft .NET Framework Version:4.0.30319; ASP.NET Version:4.0.30319.34248

        </font>

</body>  

 HttpException:  A public action method &#39;download&#39; was not found on controller     &#39;SwissTiming.DocMgmt.DMSWeb.Controllers.FileController&#39;.
at System.Web.Mvc.Controller.HandleUnknownAction(String actionName)
at System.Web.Mvc.Controller.<BeginExecuteCore>b__1d(IAsyncResult asyncResult, ExecuteCoreState innerState)
at System.Web.Mvc.Async.AsyncResultWrapper.WrappedAsyncVoid`1.CallEndDelegate(IAsyncResult asyncResult)
at System.Web.Mvc.Async.AsyncResultWrapper.WrappedAsyncResultBase`1.End()
at System.Web.Mvc.Controller.EndExecuteCore(IAsyncResult asyncResult)
at System.Web.Mvc.Controller.<BeginExecute>b__15(IAsyncResult asyncResult, Controller controller)
at System.Web.Mvc.Async.AsyncResultWrapper.WrappedAsyncVoid`1.CallEndDelegate(IAsyncResult asyncResult)
at System.Web.Mvc.Async.AsyncResultWrapper.WrappedAsyncResultBase`1.End()
at System.Web.Mvc.Controller.EndExecute(IAsyncResult asyncResult)
at System.Web.Mvc.Controller.System.Web.Mvc.Async.IAsyncController.EndExecute(IAsyncResult asyncResult)
at System.Web.Mvc.MvcHandler.<BeginProcessRequest>b__5(IAsyncResult asyncResult, ProcessRequestState innerState)
at System.Web.Mvc.Async.AsyncResultWrapper.WrappedAsyncVoid`1.CallEndDelegate(IAsyncResult asyncResult)
at System.Web.Mvc.Async.AsyncResultWrapper.WrappedAsyncResultBase`1.End()
at System.Web.Mvc.MvcHandler.EndProcessRequest(IAsyncResult asyncResult)
at System.Web.Mvc.MvcHandler.System.Web.IHttpAsyncHandler.EndProcessRequest(IAsyncResult result)
at System.Web.HttpApplication.CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute()
at System.Web.HttpApplication.ExecuteStep(IExecutionStep step, Boolean& completedSynchronously)
--><!-- 
This error page might contain sensitive information because ASP.NET is configured to show verbose error messages using &lt;customErrors mode="Off"/&gt;. Consider using &lt;customErrors mode="On"/&gt; or &lt;customErrors mode="RemoteOnly"/&gt; in production environments.-->

这基本上就是我陷入困境的地方，根本无法理解这个问题。我甚至不知道ASP.NET来自哪里。

其他试图绕过没有解决问题的问题：
1.添加
httpConnection.setRequestProperty（“User-Agent”，“Mozilla / 5.0（compatible）”）;
httpConnection.setRequestProperty（ “接受”， “ / ”）; 的，
按照建议here。还尝试使用this中的userAgent建议here 仍然在 getInputStream （）中获取 FileNotFoundException 。
2.添加 * System.setProperty（“http.agent”，“”）; *
如上所述here 3.回到原来的问题（打印Gibberish） - 我尝试用这种方式改变对InputStreamReader的调用：
新的InputStreamReader（新的URL（“www.website.com”）。openStream（），“UTF-8”）如评论here中所述，但它没有改变任何东西。
4.添加行：
con.setRequestMethod（ “POST”）; con.setDoInput（真）;
仍然得到fileNotFoundException。

我很困惑。

我甚至不确定我是否有编码问题（因为在尝试通过向连接添加内容来解决之前，没有例外，“只是”输出错误。）或者我有一些其他问题，我无法从中获取输入（如果是这样，这个特定网站有什么特别之处，因为引导我进入这个网站的网站，例如http://www.omegatiming.com/Competition?id=00010F0200FFFFFFFFFFFFFFFFFFFFFF&sport=AQ&year=2015，可能是解析没有问题）。

[[here] [1]：Using Java to pull data from a webpage?
[这里] [2]：Trying to read from a URL(in Java) produces gibberish on certain occaisions
[这里] [3]：URLConnection FileNotFoundException for non-standard HTTP port sources
[这里] [4]：Setting "User-Agent" parameters for URLConnection for querying Google from a Java application
[这里] [5]：Setting user agent of a java URLConnection
[这里] [6]：Trying to read from a URL(in Java) produces gibberish on certain occaisions

[this] [1]：http://www.whatsmyuseragent.com/

Answer 1

管理以避免必须直接从Web解析文件。

我通过将here写入this的依赖项添加到我的pom.xml并运行 mvn clean install 来获得 pdfbox 。
然后将文件下载到我的电脑中，信息是here帖子然后（现在我有pdfbox）我添加了这3行：

 PDDocument pdf = PDDocument.load(new File(“sample.pdf”));
 PDFTextStripper stripper = new PDFTextStripper();
 String plainText = stripper.getText(pdf);

mentioed http://pdfbox.apache.org/2.0/getting-started.html。

这不是完美的解决方案，它在我的PC中消耗内存来存储文件（可能只能存储一个文件并且每次都删除，但仍未检查它）并且可能会消耗太多的内存程序必须通过 getText（）方法完成解析整个文件，但它解决了我的问题，即如何解析这个特定的网站，这对我的程序来说只对于提取其中的文本很重要

[here] [1]：http://blog.e-zest.net/extracting-text-from-a-pdf-file/
[here] [2]：How to download a PDF from a given URL in Java?

[this] [1]：look

使用BufferedReader进行URLConnection解析打印Gibberish，尝试解决它导致URLConnection.getInputStream返回fileNotFoundException

1 个答案: