-已解决-使用Jsoup在标签后提取文本

时间:2019-04-13 05:17:04

标签: java jsoup

鉴于下面的代码给了我这样的输出,

<a href="https://timesofindia.indiatimes.com/india/uk-envoy-lays-wreath-at-jallianwala-bagh-memorial-expresses-deep-regret/articleshow/68860078.cms"><img border="0" hspace="10" align="left" style="margin-top:3px;margin-right:5px;" src="https://timesofindia.indiatimes.com/photo/68860078.cms" /></a>British High Commissioner to India Sir Dominic Asquith laid a wreath at the Jallianwala Bagh memorial here on Saturday on the centenary of the massacre and said Britain "deeply regretted" the suffering caused to the victims.

我正在尝试提取</a>此标记之后的文本

这是我的代码,jsoup中是否有任何方法可以做到这一点或 还有其他我想念的东西吗?

try {
            Document document = Jsoup.connect("https://timesofindia.indiatimes.com/rssfeeds/-2128936835.cms").parser(Parser.xmlParser()).get();
            Elements items = document.getElementsByTag("item");
            for (Element element : items) {
                String title = element.select("title").text();
                String link = element.select("link").text();
                String time = element.select("pubDate").text();
                String description = element.select("description").text();
            System.out.println(description);
            }
        } catch (IOException ex) {
            Logger.getLogger(TimesOfIndia.class.getName()).log(Level.SEVERE, null, ex);
        }

预期产量:英国驻印度高级专员多米尼克·阿斯奎斯爵士(Sir Dominic Asquith)周六在大屠杀百周年纪念日在贾里安瓦拉·巴格(Jallianwala Bagh)纪念馆敬献花圈,并说英国“深切遗憾”给受害者造成的痛苦。

输出:<a href="https://timesofindia.indiatimes.com/india/uk-envoy-lays-wreath-at-jallianwala-bagh-memorial-expresses-deep-regret/articleshow/68860078.cms"><img border="0" hspace="10" align="left" style="margin-top:3px;margin-right:5px;" src="https://timesofindia.indiatimes.com/photo/68860078.cms" /></a>British High Commissioner to India Sir Dominic Asquith laid a wreath at the Jallianwala Bagh memorial here on Saturday on the centenary of the massacre and said Britain "deeply regretted" the suffering caused to the victims.

2 个答案:

答案 0 :(得分:1)

Element具有nextSibling()方法,该方法应该起作用:

element.select("description").select("a").nextSibling().text();

答案 1 :(得分:0)

我使用自己的解决方法解决了该问题,这是代码

解决方案 所以我是这样做的,所以这段代码是做什么的?我创建了一个新的文档对象并删除了标签,然后简单地打印出了文本,是的,这不是最好的方法,但是仍然可以使用

d = Jsoup.parse(desc);
        Elements a = d.select("a");
        a.remove();
        System.out.println(d.body().text());

完整代码

try {
        Document d;
        Document document = Jsoup.connect("https://timesofindia.indiatimes.com/rssfeeds/-2128936835.cms").parser(Parser.xmlParser()).get();
        Elements items = document.getElementsByTag("item");
        for (Element element : items) {
            String title = element.select("title").text();
            String link = element.select("link").text();
            String time = element.select("pubDate").text();
            String desc = element.select("description").text();
            d = Jsoup.parse(desc);
            Elements a = d.select("a");
            a.remove();
            System.out.println(d.body().text());

        }
    } catch (IOException ex) {
        Logger.getLogger(TimesOfIndia.class.getName()).log(Level.SEVERE, null, ex);
    }