我有以下Java代码来解析网站代码:
URL url = new URL(urlToParse);
URLConnection con = url.openConnection();
InputStream is =con.getInputStream();
BufferedReader br = new BufferedReader(new InputStreamReader(is));
urlToParse作为参数传递给此函数,等于“http://www.omegatiming.com/file/download/?id=00010F0200FFFFFFFFFFFFFFFFFFFF03”。
代码来自here
输出 Gibberish - 充满问号和未知字符。
我尝试在 openConnection ()行之后添加这5行。
con.setRequestMethod("GET");
con.setDoOutput(true);
con.setReadTimeout(2000);
con.setChunkedStreamingMode(0);
con.connect();
提供here的解决方案,但后来我得到了这个例外:
线程“main”中的异常java.io.FileNotFoundException:http://www.omegatiming.com/file/download/?id=00010F0200FFFFFFFFFFFFFFFFFFFF03
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1835)
at the sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1440)来自 InputStream行= con.getInputStream();
将此链接复制到浏览器会将我引导至该网站,因此该网站可能无效,但调用 con.getresposeCode()会返回404.
当试图从 getErrorStream()中获取错误时,它会打印出来:
<!DOCTYPE html>
<html>
<head>
<title>The resource cannot be found.</title>
<meta name="viewport" content="width=device-width" />
<style>
body {font-family:"Verdana";font-weight:normal;font-size: .7em;color:black;}
p {font-family:"Verdana";font-weight:normal;color:black;margin-top: -5px}
b {font-family:"Verdana";font-weight:bold;color:black;margin-top: -5px}
H1 { font-family:"Verdana";font-weight:normal;font-size:18pt;color:red }
H2 { font-family:"Verdana";font-weight:normal;font-size:14pt;color:maroon }
pre {font-family:"Consolas","Lucida Console",Monospace;font-size:11pt;margin:0;padding:0.5em;line-height:14pt}
.marker {font-weight: bold; color: black;text-decoration: none;}
.version {color: gray;}
.error {margin-bottom: 10px;}
.expandable { text-decoration:underline; font-weight:bold; color:navy; cursor:hand; }
@media screen and (max-width: 639px) {
pre { width: 440px; overflow: auto; white-space: pre-wrap; word-wrap: break-word; }
}
@media screen and (max-width: 479px) {
pre { width: 280px; }
}
</style>
</head>
<body bgcolor="white">
<span><H1>Server Error in '/' Application.<hr width=100% size=1 color=silver></H1>
<h2> <i>The resource cannot be found.</i> </h2></span>
<font face="Arial, Helvetica, Geneva, SunSans-Regular, sans-serif ">
<b> Description: </b>HTTP 404. The resource you are looking for (or one of its dependencies) could have been removed, had its name changed, or is temporarily unavailable. Please review the following URL and make sure that it is spelled correctly.
<br><br>
<b> Requested URL: </b>/file/download/<br><br>
<hr width=100% size=1 color=silver>
<b>Version Information:</b> Microsoft .NET Framework Version:4.0.30319; ASP.NET Version:4.0.30319.34248
</font>
</body>
HttpException: A public action method 'download' was not found on controller 'SwissTiming.DocMgmt.DMSWeb.Controllers.FileController'.
at System.Web.Mvc.Controller.HandleUnknownAction(String actionName)
at System.Web.Mvc.Controller.<BeginExecuteCore>b__1d(IAsyncResult asyncResult, ExecuteCoreState innerState)
at System.Web.Mvc.Async.AsyncResultWrapper.WrappedAsyncVoid`1.CallEndDelegate(IAsyncResult asyncResult)
at System.Web.Mvc.Async.AsyncResultWrapper.WrappedAsyncResultBase`1.End()
at System.Web.Mvc.Controller.EndExecuteCore(IAsyncResult asyncResult)
at System.Web.Mvc.Controller.<BeginExecute>b__15(IAsyncResult asyncResult, Controller controller)
at System.Web.Mvc.Async.AsyncResultWrapper.WrappedAsyncVoid`1.CallEndDelegate(IAsyncResult asyncResult)
at System.Web.Mvc.Async.AsyncResultWrapper.WrappedAsyncResultBase`1.End()
at System.Web.Mvc.Controller.EndExecute(IAsyncResult asyncResult)
at System.Web.Mvc.Controller.System.Web.Mvc.Async.IAsyncController.EndExecute(IAsyncResult asyncResult)
at System.Web.Mvc.MvcHandler.<BeginProcessRequest>b__5(IAsyncResult asyncResult, ProcessRequestState innerState)
at System.Web.Mvc.Async.AsyncResultWrapper.WrappedAsyncVoid`1.CallEndDelegate(IAsyncResult asyncResult)
at System.Web.Mvc.Async.AsyncResultWrapper.WrappedAsyncResultBase`1.End()
at System.Web.Mvc.MvcHandler.EndProcessRequest(IAsyncResult asyncResult)
at System.Web.Mvc.MvcHandler.System.Web.IHttpAsyncHandler.EndProcessRequest(IAsyncResult result)
at System.Web.HttpApplication.CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute()
at System.Web.HttpApplication.ExecuteStep(IExecutionStep step, Boolean& completedSynchronously)
--><!--
This error page might contain sensitive information because ASP.NET is configured to show verbose error messages using <customErrors mode="Off"/>. Consider using <customErrors mode="On"/> or <customErrors mode="RemoteOnly"/> in production environments.-->
这基本上就是我陷入困境的地方,根本无法理解这个问题。我甚至不知道ASP.NET来自哪里。
其他试图绕过没有解决问题的问题:
1.添加
httpConnection.setRequestProperty(“User-Agent”,“Mozilla / 5.0(compatible)”);
httpConnection.setRequestProperty( “接受”, “ / ”); 的,
按照建议here。还尝试使用this中的userAgent建议here
仍然在 getInputStream ()中获取 FileNotFoundException 。
2.添加
* System.setProperty(“http.agent”,“”); *
如上所述here
3.回到原来的问题(打印Gibberish) - 我尝试用这种方式改变对InputStreamReader的调用:
新的InputStreamReader(新的URL(“www.website.com”)。openStream(),“UTF-8”)如评论here中所述,但它没有改变任何东西。
4.添加行:
con.setRequestMethod( “POST”);
con.setDoInput(真);
仍然得到fileNotFoundException。
我很困惑。
我甚至不确定我是否有编码问题(因为在尝试通过向连接添加内容来解决之前,没有例外,“只是”输出错误。) 或者我有一些其他问题,我无法从中获取输入(如果是这样,这个特定网站有什么特别之处,因为引导我进入这个网站的网站,例如http://www.omegatiming.com/Competition?id=00010F0200FFFFFFFFFFFFFFFFFFFFFF&sport=AQ&year=2015,可能是解析没有问题)。
[[here] [1]:Using Java to pull data from a webpage?
[这里] [2]:Trying to read from a URL(in Java) produces gibberish on certain occaisions
[这里] [3]:URLConnection FileNotFoundException for non-standard HTTP port sources
[这里] [4]:Setting "User-Agent" parameters for URLConnection for querying Google from a Java application
[这里] [5]:Setting user agent of a java URLConnection
[这里] [6]:Trying to read from a URL(in Java) produces gibberish on certain occaisions
[this] [1]:http://www.whatsmyuseragent.com/
答案 0 :(得分:0)
管理以避免必须直接从Web解析文件。
我通过将here写入this的依赖项添加到我的pom.xml并运行 mvn clean install 来获得 pdfbox 。
然后将文件下载到我的电脑中,信息是here帖子
然后(现在我有pdfbox)我添加了这3行:
PDDocument pdf = PDDocument.load(new File(“sample.pdf”));
PDFTextStripper stripper = new PDFTextStripper();
String plainText = stripper.getText(pdf);
mentioed http://pdfbox.apache.org/2.0/getting-started.html。
这不是完美的解决方案,它在我的PC中消耗内存来存储文件(可能只能存储一个文件并且每次都删除,但仍未检查它)并且可能会消耗太多的内存程序必须通过 getText()方法完成解析整个文件,但它解决了我的问题,即如何解析这个特定的网站,这对我的程序来说只对于提取其中的文本很重要
[here] [1]:http://blog.e-zest.net/extracting-text-from-a-pdf-file/
[here] [2]:How to download a PDF from a given URL in Java?
[this] [1]:look