我正在创建一个Groovy& Grails应用程序在后端使用MongoDB。我使用crawler4j进行爬网,使用JSoup进行解析功能。我需要获取URL的http状态并将其保存到数据库。我正在尝试以下方法:
@Override
void visit(Page page) {
try{
Document doc = Jsoup.connect(url).get();
Connection.Response response = Jsoup.connect(url)
.userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chroe/19.0.1042.0 Safari/535.21")
.timeout(10000)
.execute();
int statusCode = response.statusCode();
println "statuscode is " + statusCode
if (statusCode == 200)
urlExists = true //urlExists is a boolean variable
else
urlExists = false
//save to database
resource = new Resource(mimeType : "text/html", URLExists: urlExists)
if (!resource.save(flush: true, failOnError: true)) {
resource.errors.each { println it }
}
//other code
}catch(Exception e) {
log.error "Exception is ${e.message}"
}
}
@Override
protected void handlePageStatusCode(WebURL webUrl, int statusCode, String statusDescription) {
if (statusCode != HttpStatus.SC_OK) {
if (statusCode == HttpStatus.SC_NOT_FOUND) {
println "Broken link: " + webUrl.getURL() + ", this link was found in page: " + webUrl.getParentUrl()
}
else {
println "Non success status for link: " + webUrl.getURL() + ", status code: " + statusCode + ", description: " + statusDescription
}
}
}
问题是,当我获得一个http状态不是200的网址(ok)时,它直接转到handlePageStatusCode()方法(因为固有的crawler4j功能)并打印非成功消息,但它没有得到保存到数据库。当页面状态不是200时,有什么办法可以保存到数据库吗?如果我做错了,请告诉我。感谢
答案 0 :(得分:0)
为什么不将它保存到数据库中,当它归结为handlePageStatusCode? p>时
protected void handlePageStatusCode(WebURL webUrl, int statusCode, String statusDescription) {
if (statusCode != HttpStatus.SC_OK) {
if (statusCode == HttpStatus.SC_NOT_FOUND) {
println "Broken link: " + webUrl.getURL() + ", this link was found in page: " + webUrl.getParentUrl()
//save to database
}
else {
println "Non success status for link: " + webUrl.getURL() + ", status code: " + tatusCode + ", description: " + statusDescription
}
}
}
然后它将尝试下一个链接,你可以做同样的事情。
或者你可以在
之前保存它 if (statusCode == 200)
urlExists = true //urlExists is a boolean variable
else {
//save to database
urlExists = false
}
EDIT ****
将webUrl.getURL()添加到ArrayList,最后将其保存到数据库。