我正在使用HtmlUnit - Java从发布网站(ResearchGate)抓取数据。 为了抓取数据,我从文本文件中提供URL。我的文本文件中有近4000个URL(所有URL或页面都有类似的模式,但数据不同)。但是当我尝试为所有这4000个URL运行我的逻辑时,我收到错误:
com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException: 429 Too Many Requests for https://www.researchgate.net/application.RequestQuotaExceeded.html?tk=i1iSnVitFTozE0uu1nlOqH6CgwJA0vikMY_2VFnCBM3JDz4SZrupIy5I4yAT5KBOFAX-LySwTEIR4dak8u0FRHod9caWkRiNZS6RDGKXCY2Gn7kh80q72oaXjk8RWsXqqfcrNa3ULlnSHgQ
at com.gargoylesoftware.htmlunit.WebClient.throwFailingHttpStatusCodeExceptionIfNecessary(WebClient.java:537)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:362)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:434)
at com.pollak.library.Authenticator.autoLogin(Authenticator.java:70)
at com.pollak.library.FetchfromPublicationPage.main(FetchfromPublicationPage.java:34)
我的代码是:
package com.pollak.library;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.util.ArrayList;
import java.util.List;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class FetchfromPublicationPage {
public static void main(String a[]) throws Exception {
String path = "Path to the text file which contains 4000 URLs";
File file = new File(path);
BufferedReader br = new BufferedReader(new java.io.FileReader(file));
String line = null;
String baseUrl = "https://www.researchgate.net";
String login = <login_ID>;
String password = <password>;
File facurl = new File("Path to the file in which I want to save scraped information");
FileWriter fw = new FileWriter(facurl);
BufferedWriter bw = new BufferedWriter(fw);
int neha = 1;
try {
WebClient client = Authenticator.autoLogin(baseUrl + "/login", login, password);
String facultyprofileurl;
while ((facultyprofileurl = br.readLine()) != null) {
String info= "", ath = "";
String arr[] = facultyprofileurl.split(",");
HtmlPage page = client.getPage(arr[2]);
if (page.asText().contains("You need to sign in for access to this page")) {
throw new Exception(String.format("Error during login on %s , check your credentials", baseUrl));
}
List<HtmlElement> items = (List<HtmlElement>) page.getByXPath(
"//div[@class='nova-e-text nova-e-text--size-m nova-e-text--family-sans-serif nova-e-text--spacing-xxs nova-e-text--color-grey-700']");
List<HtmlElement> items2 = (List<HtmlElement>) page.getByXPath(
"//div[@class='nova-e-text nova-e-text--size-l nova-e-text--family-sans-serif nova-e-text--spacing-none nova-e-text--color-inherit nova-v-person-list-item__title nova-v-person-list-item__title--clamp-1']");
String print = "";
if (items.isEmpty()) {
System.out.println("No items found !");
} else {
for (HtmlElement htmlItem : items) {
HtmlElement articleinfo = ((HtmlElement) htmlItem.getFirstByXPath(".//ul"));
info += articleinfo.getTextContent().toString()+"**";
}
}
if (items.isEmpty()) {
System.out.println("No items found !");
} else {
for (HtmlElement htmlItem : items2) {
HtmlAnchor authors = ((HtmlAnchor) htmlItem.getFirstByXPath(".//a"));
ath += authors.getTextContent().toString()+"**";
}
}
bw.write(neha + "," + info +","+ath);
bw.newLine();
neha = neha + 1;
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
任何人都可以请指导。如何解决这个错误。
答案 0 :(得分:1)
我担心没有简单的解决方案。你必须挖掘自己,弄清楚发生了什么。
也许有些暗示。
首先,您必须熟悉Http以及它的一般工作方式。尝试理解并阅读有关您获得的错误代码。 下一步是使用Web代理(例如Charles)来查看线路上发生了什么。也许服务器发送一些附加信息(标题),其中包含有关服务器端使用的规则的提示,以检测这种情况。
接下来从一个简单的程序开始,尝试查找强制解决问题的请求数量。
总而言之,我们无法为您完成分析工作。您必须了解http的工作方式,您必须了解http服务器正在做什么,最后您可能会找到一种方法。但请记住,服务器端的人们似乎阻止像你这样的机器人(出于各种好的理由)。也许你会找到一个解决方案,但也许这个解决方案只会工作一段时间。