Question

我正在使用parallelStream（）解析html链接，如下所示： Jsoup parsing - parsing multiple links simultaneously

 public static void createPageListByObject(String urlsFileName, int Y) throws IOException {
      //List<String> URLs = new ArrayList<>();
      int indx = 1;

      URLs.parallelStream().forEach(URL-> {
        try {
            Page page = Page.Generate(URL, Y);
            FileUtils.writePageToFile(page, indx++);
        }catch (Exception e){
            System.out.println(e.getMessage() + ". Skipping to next url");
        }
    });

  public static Page Generate(String URL, int Y) throws IOException, InstantiationException, IllegalAccessException, NoSuchFieldException, URISyntaxException {
    Connection.Response res = Jsoup.connect(URL).userAgent("Chrome/5.0").timeout(10 * 1000).execute();
    Page tutorialPage = new Page(URL);
    return tutorialPage;
}

 public static void writePageToFile(Page page, int i) throws IOException{
    String directoryName = getDirectory(page.vectorXY().Y);
    ObjectOutputStream os = new ObjectOutputStream(new FileOutputStream(directoryName + "//page" + i));

    os.writeObject(page);
    os.close();
}

问题是使用parallelStream（）我有时会获得相同的索引两次并且文件被覆盖。我需要获取当前索引parallelStream正在进行中。有什么建议吗？

Answer 1

Java迭代器实现隐藏了当前索引。实际上迭代器用于在没有索引的情况下进行迭代。

如果您确实需要索引，请创建包含url和索引的对象列表。这只是样本正确封装它。

class UrlObject {
  private String url;
  private Integer index;
  public UrlObject(String url, Integer index){
    .....
  }
  // getter and setter
}

因此，当您使用

添加项目到列表添加时

List<UrlObject> URLs = new ArrayList<>();
URLS.add(new URLObject("url here", <index here>));

URLs.parallelStream().forEach(url-> {
  // code here url.getUrl() and url.getIndex()
});

或者您可以使用任何其他方法。

如何获取parallelStream（）当前索引

1 个答案: