Question

我有一项作业，要求我遍历网站的搜索结果（即期刊的链接），以收集各期刊的链接和其他元数据（作者，日期等），并将其输出到。 txt文件，至少使用Java和Apache HTTPClient，但不使用网络搜寻器。这是网站：https://www.cochranelibrary.com/cdsr/reviews/topics。我将选择显示的任何类别，并为该类别中的每个期刊链接收集上述数据。

下面是我在网上找到的一些代码，它们通过Apache HTTPClient收集HTTP实体和响应主体。我所面对的程序是使用诸如google.com之类的简单URL时，它可以正常运行，并显示页面的源代码供我稍后解析。但是，我需要使用的页面充满了JavaScript向导，并且没有合作。每个搜索结果页面均以“ / search”端点结尾。通过浏览器的Web开发人员工具，我设法找到了指向
的直接链接我正在查看的当前结果页面：

但是，当尝试使用它时，它会返回419错误，这显然是一些未授权的访问/令牌问题。因此，这是我的主要问题，我只是无法“导入”类别的搜索结果，甚至无法尝试反复进行以收集所需的数据。

public static void main(String[] args) throws IOException
{
  String url = "http://www.google.com";
  CloseableHttpClient httpclient = HttpClients.createDefault();

  try
  {
    HttpGet httpget = new HttpGet(url);
    httpget.addHeader("User-Agent",USER_AGENT);

    System.out.println("Executing request " + httpget.getRequestLine());

    ResponseHandler<String> responseHandler = new ResponseHandler<String>()
    {
      public String handleResponse(final HttpResponse response) throws IOException
      {
        int status = response.getStatusLine().getStatusCode();
        if (status >= 200 && status < 300)
        {
          HttpEntity entity = response.getEntity();
          return entity != null ? EntityUtils.toString(entity) : null;
        }
        else
        { throw new ClientProtocolException("Unexpected response status: " + status); }
      }
    };

    String responseBody = httpclient.execute(httpget, responseHandler);
    System.out.println("----------------------------------------");
    System.out.println(responseBody);

  }
  finally
  { httpclient.close(); }
}

更新因此，在查看Apache HTTPClient文档后，我对代码进行了相当多的合并，以使其更易于使用和测试：

public class App
{
    public static void main(String[] args) throws URISyntaxException
    {
      // Builds the URI
      URI uri = new URIBuilder()
          .setScheme("https")
          .setHost("www.cochranelibrary.com")
          .setPath("/")
          .build();

      // Uses Fluent API to execute GET request with uri
      try
      { System.out.println(Request.Get(uri).execute().returnContent().asString()); }
      catch (IOException e)
      { e.printStackTrace(); }
    }
}

但是，这个特定域继续给我带来问题。仅尝试从www.cochranelibrary.com进行GET就会返回419错误。我以为HTTPS协议可能有问题，但是在https://www.httpvshttps.com上进行测试可以很好地返回结果。我不知道为什么这个特定的域名很固执。

如何浏览网站搜索结果并收集元数据？

0 个答案: