Question

我试图为ASME数字馆藏进行一些研究。我陷入了困境。考虑以下链接： http://mechanicaldesign.asmedigitalcollection.asme.org/article.aspx?articleid=1897362

以上链接将您带到其中一个出版物。该页面包含作者信息，包括上标（1），如果作者是通讯作者，则提及该上标。我需要找出哪位作者是通讯作者。在上面的例子中，它是＆＃34; Julie S. Linsey＆＃34;。我尝试过以下方法：

doc.select("sup")
doc.select("div[id=scm6MainContent_lblAuthors] a.disclosureLink special")
doc.getElementsByAttributeValue("href", "#cor1") 
Elements elementsByClass2 = doc.getElementsByClass("disclosureLink special"); // and then iterating on them to check if I can retrieve <sup> element.

它们似乎都不起作用。

你可以帮忙吗？

Answer 1

我注意到如果您不提供用户代理，则html将不包含scm6MainContent_lblAuthors

里面的元素用span分隔，所以如果我们得到2个连续的“a”标签，就意味着作者有一个上标

        Document doc = Jsoup.connect("http://mechanicaldesign.asmedigitalcollection.asme.org/article.aspx?articleid=1897362")
                .userAgent("Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6").get();
        Elements all = doc.select("#scm6MainContent_lblAuthors");
        Elements els = all.first().children();

        for (int i = 0; i < els.size(); i++) {
            Element el = els.get(i);
            if ("a".equals(el.tagName())) {
                if (i + 1 < els.size() && "a".equals(els.get(i + 1).tagName())) {
                    System.out.println(el.text());
                }
            }

        }

Jsoup：虽然我可以在页面的html源代码中看到它们，但无法选择某些元素

1 个答案: