java jsoup删除新行

时间:2016-08-30 23:45:57

标签: java jsoup

        for (int x = 0; x < 8000; x += 50) {
            Document doc = Jsoup.connect("localhost.com/"+x).get();
            Elements links = doc.select("a[href]");
            for (Element link: links) {
                String text = link.text();
                System.out.println(text);
            }

        }
    }
}

这将产生如下输出:

Adrian Riven

HalfSugar No Ice

Yassuo

Amandadog

P1 Sloosh

无论如何要删除空行?所以它看起来像输出:

Adrian Riven
HalfSugar No Ice
Yassuo
Amandadog
P1 Sloosh
我试过了 text.replace(“\ n”,“”); text.replaceAll(“\ r?\ n”,“”)

像这样编辑,这对我不起作用 没试过另一个

   Elements links = doc.select("a[href]");
        for (Element link: links) {
            Document docs = Jsoup.parse(String.valueOf(links));
            docs.outputSettings().escapeMode(Entities.EscapeMode.xhtml);
            String text = link.text()+link.text();
            System.out.println(text.replace("Show More", ""));

示例html:

</td>
    <td class="SummonerName Cell">
        <a href="/summoner/userName=Cris" class="Link">Cris</a>
    </td>
                <td class="TierRank Cell">Challenger</td>
        <td class="LP Cell">1,137 LP</td>
            <td class="TeamName Cell">
                        Apex Gaming
                </td>
    <td class="RatioGraph Cell">
                        <div class="WinRatioGraph">
                <div class="Graph">

2 个答案:

答案 0 :(得分:0)

删除可能很棘手,因为有些html标记总是空的,如<br/> </ img>等,

如果您可以决定愿意删除哪些元素,请尝试以下

// Names of the elements to remove if empty
Set<String> ElementsRemove = ....

// Parse the html into a jsoup document
Document source = Jsoup.parse(myHtml);

// Clean the html according to a whitelist

Document cleaned = new Cleaner(whitelist).clean(source);

// For each element in the cleaned document
for(Element el: cleaned.getAllElements()) {

if(el.children().isEmpty() && !el.hasText()) {
   // Element is empty, check if should be removed
   if(removable.contains(el.tagName())) el.remove();
   }
}

或更改 OutputSettings

final String html = ...;
OutputSettings settings = new OutputSettings();
settings.escapeMode(Entities.EscapeMode.xhtml);
String cleanHtml = Jsoup.clean(html, "", Whitelist.relaxed(), settings);

这也可以通过Jsoup解析的文档来实现:

Document doc = Jsoup.parse(...);
doc.outputSettings().escapeMode(Entities.EscapeMode.xhtml);
// ...

答案 1 :(得分:0)

这个技巧对我有用:

 Document doc = Jsoup.connect("localhost.com").get();

        Elements links = doc.select("a[href]");
        for (Element link : links) {
            if (!link.text().isEmpty())
                System.out.println(link.text());

        }