Question

我需要使用nutch抓取网站https://hl.com的帖子，但这些网站要求在某些网页上登录。喜欢简介和某些帖子。所以我需要首先进行身份验证，我尝试使用下面的代码，但它没有工作我得到一个空白的HTML。

String url="https://hl.com/user/Joanne74";
Connection.Response res =
Jsoup.connect("https://hl.com/login")
.data("email", "email", "password",
"mypassword").method(Method.POST).timeout(0).execute();

Map<String, String> cookies = res.cookies();

Connection connection = Jsoup.connect(url);
org.jsoup.nodes.Document doc = connection.cookies(cookies).timeout(0).get();

Answer 1

此页面很棘手。它在很大程度上依赖于javascript来使用ajax加载动态内容。

登录表单将用户名和密码发布到https://healthunlocked.com/api/session（而不是https://healthunlocked.com/login）。您可以使用浏览器的调试器进行预览。
使用.ignoreContentType(true)来避免Exception in thread "main" org.jsoup.UnsupportedMimeTypeException: Unhandled content type.，因为它会将JSON作为回复发送。
解析https://healthunlocked.com/user/Joanne74没用，因为它只加载了一些javascript，但您可以使用调试器来观察其他内容请求： https://healthunlocked.com/api/posts?userId=909195或 https://healthunlocked.com/api/activity?filter=user-activity-public&pageNumber=1&id=909195或 https://healthunlocked.com/api/profile?username=Joanne74&showPrivateFields=false 获得你需要的所有信息，但又一次 - 它是JSON，所以你需要除了jsoup以外的其他库来进一步解析它。

Nutch登录网站进行抓取

1 个答案: