Question

我想用Java解析来自URL的HTML文档。

当我在浏览器中输入网址（chrome）时，它不会显示html页面，但会下载它。

因此，网址是网页上“下载”按钮背后的链接。到目前为止没问题。该网址为“https://www.shazam.com/myshazam/download-history”，如果我将其粘贴到我的浏览器中，则可以下载。但是当我尝试使用java下载它时，我得到一个401（禁止）错误。

我在加载网址时检查了chrome网络工具并发现我的个人资料数据和注册Cookie在http GET。

我尝试了很多不同的方法但没有效果。所以我的问题是，我如何用java生成这个？如何获取（下载）HTML文件并解析它？

更新

这是我们到目前为止所发现的（感谢Andrew Regan）：

BasicCookieStore store = new BasicCookieStore();
store.addCookie( new BasicClientCookie("profile-data", "value") );  // profile-data
store.addCookie( new BasicClientCookie("registration", "value") );  // registration
Executor executor = Executor.newInstance();
String output = executor.use(store)
            .execute(Request.Get("https://www.shazam.com/myshazam/download-history"))
            .returnContent().asString();

最后一行代码似乎导致NullPointerException。其余代码似乎可以正常加载未受保护的网页。

Answer 1

因此，如果您删除这些cookie /使用私人会话，浏览器应该重现您在代码中看到的内容。

我猜你需要先去＆＃34; http://www.shazam.com/myshazam＆＃34;并登录。

Answer 2

您可以尝试将Cookie值添加到GET请求中，例如使用HttpClient Fluent API：

CookieStore store = new BasicCookieStore();
store.addCookie( new BasicClientCookie(name, value) );  // profile-data
store.addCookie( new BasicClientCookie(name, value) );  // registration

Executor executor = Executor.newInstance();
String output = executor.cookieStore(store)
        .execute(Request.Get("https://www.shazam.com/myshazam/download-history"))
        .returnContent().asString();

要解析你可以这样做：

Element dom = Jsoup.parse(output);
for (Element element : result.select("tr td")) {
    String eachCellValue = element.text();
    // Whatever
}

（你没有提供更多细节）

Answer 3

我自己找到了答案。使用HttpURLConnection，此方法可用于“验证”各种服务。我使用chrome的内置网络工具来获取GET请求的cookie值。

HttpURLConnection con = (HttpURLConnection) new URL("https://www.shazam.com/myshazam/download-history").openConnection();
con.setRequestMethod("GET");
con.addRequestProperty("Cookie","registration=Cooki_Value_Here;profile-data=Cookie_Value_Here");
BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream()));
String inputLine;
    while ((inputLine = in.readLine()) != null) 
    System.out.println(inputLine);
    in.close();

使用HTTP GET下载文件，在java中传递cookie

3 个答案: