Question

我正在尝试将网站与JSoup连接，但无法正常工作。

这是我的代码：

        Connection.Response res = Jsoup.connect("http://www.metalbulletin.com/Login.html?ReturnURL=%2fdefault.aspx&")
        .data("username", "94mkr@mail4gmail.com", "password", "jakdjique&THFI#")
        .method(Method.POST)
        .execute();

        Map<String, String> loginCookies = res.cookies();

        Document doc = Jsoup.connect("https://www.metalbulletin.com/Article/3838710/Home/CHINA-REBAR-Domestic-prices-recover-after-trading-pick-up.html")
        .cookies(loginCookies)
        .get();

        Element article             = doc.getElementById("article-body");   
        Elements heading            = article.getElementsByTag("h1");
        Elements lead               = article.getElementsByClass("lead");
        Elements lead1              = article.getElementsByClass("articleContainer");

        System.out.println(lead);   
        System.out.println(lead1);

我刚刚输入了临时登录名/密码，以便您进行检查我注意到http://www.metalbulletin.com/Login.html?ReturnURL=%2fdefault.aspx&会生成一个新链接，例如：
https://account.metalbulletin.com/identity/login?signin=fab48076d8a4f74f52565dd6a9f47e65

我尝试了很多，但仍然无法访问该网站

更新
我将代码优化如下：

Connection.Response response = Jsoup.connect("http://www.metalbulletin.com/Login.html?ReturnURL=%2fdefault.aspx&")
    .method(Connection.Method.GET)
    .execute();

    response = Jsoup.connect("http://www.metalbulletin.com/Login.html?ReturnURL=%2fdefault.aspx&")
    .data("username", "94mkr@mail4gmail.com", "password", "jakdjique&THFI#")
    .cookies(response.cookies())
    .method(Connection.Method.POST)
    .execute();

    Map<String, String> cookies = new HashMap<String, String>();

    Document doc = Jsoup.connect("https://www.metalbulletin.com/Article/3838710/Home/CHINA-REBAR-Domestic-prices-recover-after-trading-pick-up.html")
    .cookies(response.cookies())
    .get();

    System.out.println(response.statusMessage()+"\n"+response.statusCode());

我编译时的输出是：
OK 200
但是当我继续进行数据的下一部分提取时：

    Element article             = doc.getElementById("article-body");   
    Elements lead               = article.getElementsByClass("lead");
    Elements lead1              = article.getElementsByClass("articleContainer");

    System.out.println(lead);   
    System.out.println(lead1);

然后放弃并显示显示给未登录用户的数据

Answer 1

假设您要使用给定的凭据浏览网站，建议您从普通浏览器登录。复制网站生成的Cookie，并将其添加到CookieStore实例。

    BasicCookieStore cookieStore = new BasicCookieStore();

    BasicClientCookie cookie1 = new BasicClientCookie("__gads", "ID=958b183c83ede6e8:T=1539776783:S=ALNI_MbFRRpTafZvTiJAjKmTB9oBQelWWw");
    cookie1 .setDomain(".metalbulletin.com");
    cookie1 .setPath("/");

    BasicClientCookie cookie2 = new BasicClientCookie("__utma", "167598498.350699797.1539776871.1539776871.1539776871.1");
    cookie2 .setDomain(".metalbulletin.com");
    cookie2 .setPath("/");
    ....
    cookieStore.addCookie(cookie1);
    cookieStore.addCookie(cookie2);
    ....

然后在创建连接池时使用cookiestore。

    PoolingHttpClientConnectionManager connManager = new PoolingHttpClientConnectionManager();
    connManager.setMaxTotal(256);
    connManager.setDefaultMaxPerRoute(64);
    ConnectionKeepAliveStrategy myStrategy = new DefaultConnectionKeepAliveStrategy();
    CloseableHttpClient closeableHttpClient = HttpClientBuilder.create()
            .setDefaultCookieStore(getCookieStore())
            .setDefaultRequestConfig(RequestConfig.custom()
                    .setCookieSpec(CookieSpecs.STANDARD).build())
            .setConnectionManager(connManager).setKeepAliveStrategy(myStrategy).build();

因为无论如何，如果您要登录网站。然后，您需要一种处理Cookie和令牌的方法。这样，cookiestore将处理cookie。您只需使用http客户端调用网站，然后使用jsoup解析返回的html。

修改：这些是您需要遵循的步骤：

使用浏览器登录。
创建一个BasicCookieStore，其中包含浏览器保存的所有cookie。您可以使用开发者控制台，并在每次浏览网站时监视正在更改的cookie，从而找出最重要的cookie。但是为了安全起见，请将它们全部添加。
创建HttpClientManager并将cookieStore添加到其中。
现在考虑自己已登录并开始调用您需要抓取的页面。只需使用您生成的客户端池发出get请求即可。例如：转到“ https://www.metalbulletin.com/Article/3838710/Home/CHINA-REBAR-Domestic-prices-recover-after-trading-pick-up.html”页面
如果一切正确完成，则请求应返回HTML页面源。
使用Jsoup.parse（stringHtml）将字符串响应转换为Document对象。
解析响应并提取所需的元素。
发出另一个请求。以字符串形式获取响应。用jsoup解析html。重复。

祝你好运。

JSoup未使用登录密码连接网站

1 个答案: