我试图用Jsoup阅读论坛页面,但我无法这样做。我已成功登录,但我能够阅读第一页或列表页面。但是当我进入主题页面时,它给了我403.这里是代码:
Connection.Response loginForm = Jsoup.connect("http://picturepub.net/index.php?login/login").method(Connection.Method.GET)
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0").timeout(0).execute();
Document doc = Jsoup.connect("http://picturepub.net/index.php?login/login").data("cookieexists", "false").data("cookie_check", "1").data("login", "swordblazer")
.data("password", "picturepub").data("register", "0").data("redirect", "/index.php").cookies(loginForm.cookies())
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0").post();
doc = loginForm.parse();
Map<String, String> cookies = loginForm.cookies();
List<String> urls = new ArrayList<String>();
List<String> threadUrls = new ArrayList<String>();
int h = 0;
for (int i = 1; i < 20; i++) {
if (i == 1)
doc = Jsoup.connect("http://picturepub.net/index.php?forums/photoshoots-magazines.51/")
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0").cookies(cookies).get();
else
doc = Jsoup.connect("http://picturepub.net/index.php?forums/photoshoots-magazines.51/page-" + i)
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0").cookies(cookies).get();
// get all links
Elements links = doc.select("a[href]");
System.out.println(doc.title());
for (Element element : links) {
if (element.absUrl("href").contains("threads")) {
String linkImage = element.absUrl("href");
Document document = Jsoup.connect(linkImage).cookies(cookies).get();
if (!threadUrls.contains(linkImage)) {
threadUrls.add(linkImage);
h++;
}
}
}
}
答案 0 :(得分:0)
JSoup连接彼此无关,因此它们不共享“登录状态/会话”。你必须小心地自己复制它们之间的状态。您将获得HTTP 403,原因如下:
loginForm
响应不会返回身份验证Cookie,也不能将它们用于授权资源,但您稍后会使用这些Cookie。POST http://picturepub.net/index.php?login/login
响应中获取Cookie,并且不要将其转换为文档。必须使用method(POST)
将第二个请求声明为POST请求。Jsoup.connect(linkImage).cookies(cookies).get();
未命中User-Agent
。为了使代码不易出错,您应该将重构视为使代码更加健壮的一种方法。
private static final String USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0";
private static final String BASE_URL = "http://picturepub.net/index.php";
private static final int PAGE_COUNT = 20;
static void grab()
throws IOException {
out.println("Getting the login form...");
final Response getLoginFormResponse = prepareConnection(GET, "?login/login", emptyMap())
.execute();
out.println("Posting the login data...");
// Avoid converting to document when it's unnecessary and use `execute()`
final Response postLoginFormResponse = prepareConnection(POST, "?login/login", getLoginFormResponse.cookies())
.data("cookieexists", "false")
.data("cookie_check", "1")
.data("login", ...YOUR USERNAME...)
.data("password", ...YOUR PASSWORD...)
.data("register", "0")
.data("redirect", "/index.php")
.execute();
// Obtain the authentication cookies
final Map<String, String> cookies = postLoginFormResponse.cookies();
// If you want to discard duplicates, just don't use lists -- sets are designed for unique elements.
// The `h` is unnecessary because you can query the collection for its size: threadUrls.size()
final Collection<String> threadUrls = new LinkedHashSet<>();
for ( int i = 1; i <= PAGE_COUNT; i++ ) {
out.printf("Page #%d...\n", i);
final Document getPageDocument = prepareConnection(GET, "?forums/photoshoots-magazines.51/" + (i == 1 ? "" : "page-" + i), cookies)
.execute()
.parse();
out.printf("Page #%d: %s\n", i, getPageDocument.title());
// `a[href*=threads/]` is a selector to obtain all links having the "threads/" in <A> element URLs -- no need to check for substring later
// The following code uses Java 8 streams to filter out duplicate links on the page
final Iterable<String> hrefs = getPageDocument.select("a[href*=threads/]")
.stream()
.map(e -> e.absUrl("href"))
.collect(toSet());
for ( final String href : hrefs ) {
out.printf("Probing: %s ... ", href);
final Response analyzeMeResponse = prepareConnection(GET, stripBaseUrl(href), cookies)
.execute();
threadUrls.add(href);
out.println("Done!");
}
}
out.println(threadUrls);
}
private static String stripBaseUrl(final String url)
throws IllegalArgumentException {
if ( !url.startsWith(BASE_URL) ) {
// This must not happen for a well-written parser
throw new IllegalArgumentException(url);
}
return url.substring(BASE_URL.length());
}
// Just make sure that a particular connection is:
// * bound to the BASE_URL defined above
// * bound to a specific HTTP method
// * follows redirects
// * User-Agent is set
// * cookies are always set
private static Connection prepareConnection(final Method method, final String url, final Map<String, String> cookies) {
return connect(BASE_URL + url)
.followRedirects(true)
.method(method)
.userAgent(USER_AGENT)
.cookies(cookies);
}
上面的代码基于org.jsoup:json:1.10.1
,因为之前的JSoup版本无法处理该网站使用的HTTP 307 Temporary Redirect。