Question

我正在使用HtmlUnit - Java从发布网站（ResearchGate）抓取数据。为了抓取数据，我从文本文件中提供URL。我的文本文件中有近4000个URL（所有URL或页面都有类似的模式，但数据不同）。但是当我尝试为所有这4000个URL运行我的逻辑时，我收到错误：

com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException: 429 Too Many Requests for https://www.researchgate.net/application.RequestQuotaExceeded.html?tk=i1iSnVitFTozE0uu1nlOqH6CgwJA0vikMY_2VFnCBM3JDz4SZrupIy5I4yAT5KBOFAX-LySwTEIR4dak8u0FRHod9caWkRiNZS6RDGKXCY2Gn7kh80q72oaXjk8RWsXqqfcrNa3ULlnSHgQ
    at com.gargoylesoftware.htmlunit.WebClient.throwFailingHttpStatusCodeExceptionIfNecessary(WebClient.java:537)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:362)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:434)
    at com.pollak.library.Authenticator.autoLogin(Authenticator.java:70)
    at com.pollak.library.FetchfromPublicationPage.main(FetchfromPublicationPage.java:34)

我的代码是：

package com.pollak.library;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.util.ArrayList;
import java.util.List;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class FetchfromPublicationPage {

    public static void main(String a[]) throws Exception {
        String path = "Path to the text file which contains 4000 URLs";
        File file = new File(path);
        BufferedReader br = new BufferedReader(new java.io.FileReader(file));
        String line = null;

        String baseUrl = "https://www.researchgate.net";
        String login = <login_ID>;
        String password = <password>;

        File facurl = new File("Path to the file in which I want to save scraped information");
        FileWriter fw = new FileWriter(facurl);
        BufferedWriter bw = new BufferedWriter(fw);
        int neha = 1;


        try {
            WebClient client = Authenticator.autoLogin(baseUrl + "/login", login, password);
            String facultyprofileurl;
            while ((facultyprofileurl = br.readLine()) != null) {

                String info= "", ath = "";
                String arr[] = facultyprofileurl.split(",");

                HtmlPage page = client.getPage(arr[2]);

                if (page.asText().contains("You need to sign in for access to this page")) {
                    throw new Exception(String.format("Error during login on %s , check your credentials", baseUrl));
                }

                List<HtmlElement> items = (List<HtmlElement>) page.getByXPath(
                        "//div[@class='nova-e-text nova-e-text--size-m nova-e-text--family-sans-serif nova-e-text--spacing-xxs nova-e-text--color-grey-700']");

                List<HtmlElement> items2 = (List<HtmlElement>) page.getByXPath(
                        "//div[@class='nova-e-text nova-e-text--size-l nova-e-text--family-sans-serif nova-e-text--spacing-none nova-e-text--color-inherit nova-v-person-list-item__title nova-v-person-list-item__title--clamp-1']");

                String print = "";

                if (items.isEmpty()) {
                    System.out.println("No items found !");
                } else {
                    for (HtmlElement htmlItem : items) {

                        HtmlElement articleinfo = ((HtmlElement) htmlItem.getFirstByXPath(".//ul"));
                        info += articleinfo.getTextContent().toString()+"**";

                    }
                }

                if (items.isEmpty()) {
                    System.out.println("No items found !");
                } else {
                    for (HtmlElement htmlItem : items2) {

                        HtmlAnchor authors = ((HtmlAnchor) htmlItem.getFirstByXPath(".//a"));
                        ath +=  authors.getTextContent().toString()+"**";


                    }
                }

                bw.write(neha + "," + info +","+ath);
                bw.newLine();
                neha = neha + 1;

            }

        } catch (Exception e) {
            e.printStackTrace();
        }

    }
}

任何人都可以请指导。如何解决这个错误。

Answer 1

我担心没有简单的解决方案。你必须挖掘自己，弄清楚发生了什么。

也许有些暗示。

首先，您必须熟悉Http以及它的一般工作方式。尝试理解并阅读有关您获得的错误代码。下一步是使用Web代理（例如Charles）来查看线路上发生了什么。也许服务器发送一些附加信息（标题），其中包含有关服务器端使用的规则的提示，以检测这种情况。

接下来从一个简单的程序开始，尝试查找强制解决问题的请求数量。

总而言之，我们无法为您完成分析工作。您必须了解http的工作方式，您必须了解http服务器正在做什么，最后您可能会找到一种方法。但请记住，服务器端的人们似乎阻止像你这样的机器人（出于各种好的理由）。也许你会找到一个解决方案，但也许这个解决方案只会工作一段时间。

HtmlUnit错误 - com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException：429请求太多

1 个答案: