Question

我有以下问题。我正在使用jSoup从页面中提取图像（我正在尝试下载漫画），然后转到下一页，下载下一个图像等等... 通常，我会从按钮中提取URL到下一页：

<a href="2.html" class="btn next_page"><span></span>next page</a>

但是当漫画的一个章节结束时，当我点击页面上的按钮时，它会通过JavaScript将我重定向到下一章：

<a href="javascript:void(0);" onclick="next_chapter()" class="btn next_page"><span></span>next page</a>

有没有办法解压缩到下一页的链接？之前有人建议我使用Selenium，我尝试了几次并且失败了。也许有人有任何建议吗？

好的，这就是我的代码段：

while (endManga) {

            Document doc = Jsoup.connect(link).get();
            String title = doc.title();
            System.out.println(title);

            Element nextButtonDiv = doc.getElementById("top_center_bar");
            Elements nextButton = nextButtonDiv.select("a[href]");
            if (nextButton.isEmpty())
                endManga = true;
            else {
                Element nextLinkElement = nextButton
                        .get(nextButton.size() - 1);

                String nextLink;


                //here is the problem - at some point, when one chapter ends, 
                //there isn't link to the next one, only "onclick="next_chapter()"" javascript function


                if (nextLinkElement.attr("href").length() < 10)
                    nextLink = nextLinkElement.attr("abs:href");
                else
                    nextLink = nextLinkElement.attr("href");

                link = nextLink;
            }
            Element content = doc.getElementById("viewer");
            Elements jpgs = content.select("img[src$=.jpg]");

            BufferedImage image = null;

            if (jpgs.isEmpty()) {
                System.out.println("empty!!");
                counterVolume++;
            } else {
                for (Element imageURL : jpgs) {
                    image = ImageIO.read(new URL(imageURL.attr("src")));
                    ImageIO.write(image, "jpg", new File("manga/"
                            + counterVolume + "_" + counterPage++ + ".jpg"));
                    System.out.println("zgrane - volume: " + counterVolume
                            + " , page: " + counterPage);
                }
            }
        }

这是我的代码，我使用了selenium：

WebDriver driver = new HtmlUnitDriver();
    driver.get("link_to_page_with_javascript_function");
    WebElement element = driver.findElement(By.id("top_center_bar"));
    List<WebElement> el = element.findElements(By.tagName("a"));
    System.out.println(element.getTagName());

    for(WebElement e : el){
        if(e.getText().equals("next page")){
            //here I have the button, which clicked redirects me to next chapter
            //how can I extract the link from this function??
            e.click();
        }
    }

Answer 1

如果网址结构一致，您可以通过了解您已经到达本章末尾来手动构建正确的网址，作为提取算法的一个特例。

if (endOfChapter) {
  url = 'chapter-' + newChapterNum + '/1.html'; // first page of new chapter
}

我知道这不是一个通用的解决方案，但根据您的应用范围，它可能就是您所需要的。

Answer 2

我不认为有一个简单的解决方案，实际上没有让硒做这项工作。但是，我看到了这些可能性：

如果查看源代码，您可以理解JavaScript函数并使用Java重新编写其工作方式。如果它从网络加载某些内容，您可能需要查看由点击创建的流量。如果没有您想要获得的来源，我就无法更具体。
像你一样使用Selenium和click（）。然后从selenium获取加载的URL。您要查找的方法称为driver.getCurrentUrl()。当然，可能更容易获取页面的来源（driver.getPageSource()）并将其反馈给JSoup，然后使用常规的JSoup方法。

Answer 3

<a href="link-to-the-next-page.html" onclick="next_chapter()" class="btn next_page"><span></span>next page</a>

然后

var next_chapter = function next_chapter(ev){
  ev.preventDefault() ;
  var linkToTheNextPAge = this.href ;
  doSomething(linkToTheNextPAge) ;
}

onclick将被执行，链接将不会跟随。如果我是你，我会使用eventListener

来做到这一点

从javascript函数中提取URL

3 个答案: