拆分字符串中的第一个大写字母

时间:2014-09-30 17:26:18

标签: java string split jsoup

所以,我是从一个歌词网站上抓取的,我想把它格式化,就像网站上的那样。现在,当我得到我的输出时,字符串就像这样在同一行。我使用Jsoup从HTML中获取信息。我想要做的是在大写字母之前分割每一行,就像网站上的歌词一样。

I was told a million times Of all the troubles in my way How I had to keep on trying Little better ev'ry day But if I crossed a million rivers And I rode a million miles Then I'd still be where I started Bread and butter for a smile Well I sold a million mirrors In a shop in Alley Way But I never saw my face In any window any day Well they say your folks are telling you To be a super star But I tell you just be satisfied To stay right where you are Keep yourself alive keep yourself alive It'll take you all your time and a money Honey you'll survive Well I've loved a million women In a belladonic haze And I ate a million dinners Brought to me on silver trays Give me ev'rything I need To feed my body and my soul And I'll grow a little bigger Maybe that can be my goal I was told a million times Of all the people in my way How I had to keep on trying And get better ev'ry day But if I crossed a million rivers And I rode a million miles Then I'd still be where I started Still be where I started Keep yourself alive keep yourself alive It'll take you all your time and money honey You'll survive Keep yourself alive Keep yourself alive It'll take you all your time and money To keep me satisfied Do you think you're better ev'ry day No I just think I'm two steps nearer to my grave Keep yourself alive Keep yourself alive mm You take your time and take your money Keep yourself alive Keep yourself alive Keep yourself alive All you people keep yourself alive Keep yourself alive Keep yourself alive It'll take you all your time and a money To keep me satisfied Keep yourself alive Keep yourself alive All you people keep yourself alive Take you all your time and money honey You will survive Keep you satisfied Keep you satisfied

我希望它的格式如下:http://prntscr.com/4rt1cf

到目前为止,我的代码是:

public static void lyricScrape() throws IOException {

    Scanner search = new Scanner(System.in);
    String artist;
    String song;
    Document doc;

        artist = search.nextLine();
        artist = artist.toLowerCase();
        artist = artist.replaceAll(" ", "");
        System.out.println("Artist saved");

        song = search.nextLine();
        song = song.toLowerCase();
        System.out.println("Song saved");
        song = song.replaceAll(" ", "");

        doc = Jsoup.connect("http://www.azlyrics.com/lyrics/"+artist+"/"+song+".html").get();
        Elements element = doc.select("div[style^=margin]");
        String lyrics = element.text();
        System.out.println(lyrics);


    }

2 个答案:

答案 0 :(得分:2)

String.split采用正则表达式。大写字母的正则表达式为"[A-Z]",但您希望保留该字符,从而查找"\\ [A-Z]"(之前的空格)。最后让它不捕捉这封信:

String[] lines = lyrics.split("\\ (?=[A-Z])");
formatted = lyrics.replaceAll("\\ (?=[A-Z])", "\n");

要弥补单字母I,您可以使用

String[] lines = lyrics.split("\\ (?!I\\s)(?=[A-Z])");
formatted = lyrics.replaceAll("\\ (?!I\\s)(?=[A-Z])", "\n");

答案 1 :(得分:0)

根据How do I preserve line breaks when using jsoup to convert html to plain text?

回答

如何在HMTL中的每个<br/>之后添加一些特殊文字。这样,当您致电text()时,您将拥有line<br/>line之类的line[specialString]line,然后您可以将此[specialString]替换为\n。我的意思是

element.select("br").append("@REPLACEME@");
String lyrics = element.text().replaceAll("\\s*@REPLACEME@\\s*", "\n");

您还可以对歌词的HTML文本代码使用Jsoup.clean方法删除所有不需要的标记,例如<b> <i> <!-- comments -->,除了您在此情况下定义的标记之外<br />然后将此br标记替换为\n"",具体取决于您的HTML是否在<br/>之后实际换行。所以你的代码看起来像

String lyrics = Jsoup.clean(
                    element.html(), //html to clean
                    Whitelist.none().addTags("br")//allowed tags
                ).replace("<br /> ", "");