如何在Java中的n个单词后截断字符串?

时间:2013-04-11 17:43:04

标签: java string

是否有库具有在n个单词后截断字符串的例程?我正在寻找可以转变的东西:

truncateAfterWords(3, "hello, this\nis a long sentence");

进入

"hello, this\nis"

我可以自己写,但我认为这样的东西可能已经存在于一些开源字符串操作库中。


以下是我希望任何解决方案可以通过的测试用例的完整列表:

import java.util.regex.*;

public class Test {

    private static final TestCase[] TEST_CASES = new TestCase[]{
        new TestCase(5, null, null),
        new TestCase(5, "", ""),
        new TestCase(5, "single", "single"),
        new TestCase(1, "single", "single"),
        new TestCase(0, "single", ""),
        new TestCase(2, "two words", "two words"),
        new TestCase(1, "two words", "two"),
        new TestCase(0, "two words", ""),
        new TestCase(2, "line\nbreak", "line\nbreak"),
        new TestCase(1, "line\nbreak", "line"),
        new TestCase(2, "multiple  spaces", "multiple  spaces"),
        new TestCase(1, "multiple  spaces", "multiple"),
        new TestCase(3, " starts with space", " starts with space"),
        new TestCase(2, " starts with space", " starts with"),
        new TestCase(10, "A full sentence, with puncutation.", "A full sentence, with puncutation."),
        new TestCase(4, "A full sentence, with puncutation.", "A full sentence, with"),
        new TestCase(50, "Testing a very long number of words in the testcase to see if the solution performs well in such a situation.  Some solutions don't do well with lots of input.", "Testing a very long number of words in the testcase to see if the solution performs well in such a situation.  Some solutions don't do well with lots of input."),
    };

    public static void main(String[] args){
        for (TestCase t: TEST_CASES){
            try {
                String r = truncateAfterWords(t.n, t.s);
                if (!t.equals(r)){
                    System.out.println(t.toString(r));
                }
            } catch (Exception x){
                System.out.println(t.toString(x));
            }       
        }   
    }

    public static String truncateAfterWords(int n, String s) {
        // TODO: implementation
        return null;
    }
}


class TestCase {
    public int n;
    public String s;
    public String e;

    public TestCase(int n, String s, String e){
        this.n=n;
        this.s=s;
        this.e=e;
    }

    public String toString(){
        return "truncateAfterWords(" + n + ", " + toJavaString(s) + ")\n  expected: " + toJavaString(e);
    }

    public String toString(String r){
        return this + "\n  actual:   " + toJavaString(r) + "";
    }

    public String toString(Exception x){
        return this + "\n  exception: " + x.getMessage();
    }    

    public boolean equals(String r){
        if (e == null && r == null) return true;
        if (e == null) return false;
        return e.equals(r);
    }   

    public static final String escape(String s){
        if (s == null) return null;
        s = s.replaceAll("\\\\","\\\\\\\\");
        s = s.replaceAll("\n","\\\\n");
        s = s.replaceAll("\r","\\\\r");
        s = s.replaceAll("\"","\\\\\"");
        return s;
    }

    private static String toJavaString(String s){
        if (s == null) return "null";
        return " \"" + escape(s) + "\"";
    }
}

此网站上有其他语言的解决方案:

4 个答案:

答案 0 :(得分:4)

您可以使用简单的基于正则表达式的解决方案:

private String truncateAfterWords(int n, String str) {
   return str.replaceAll("^((?:\\W*\\w+){" + n + "}).*$", "$1");    
}

现场演示:http://ideone.com/Nsojc7

更新:根据您的意见解决性能问题:

使用以下方法可以在处理大量单词时提高性能:

private final static Pattern WB_PATTERN = Pattern.compile("(?<=\\w)\\b");

private String truncateAfterWords(int n, String s) {
   if (s == null) return null;
   if (n <= 0) return "";
   Matcher m = WB_PATTERN.matcher(s);
   for (int i=0; i<n && m.find(); i++);
   if (m.hitEnd())
      return s;
   else
      return s.substring(0, m.end());
}

答案 1 :(得分:2)

我找到了一种使用java.text.BreakIterator类的方法:

private static String truncateAfterWords(int n, String s) {
    if (s == null) return null;
    BreakIterator wb = BreakIterator.getWordInstance();
    wb.setText(s);
    int pos = 0;
    for (int i = 0; i < n && pos != BreakIterator.DONE && pos < s.length();) {
        if (Character.isLetter(s.codePointAt(pos))) i++;
        pos = wb.next();
    }
    if (pos == BreakIterator.DONE || pos >= s.length()) return s;
    return s.substring(0, pos);
}

答案 2 :(得分:0)

这是一个使用正则表达式查找循环中下一组空格的版本,直到它有足够的单词。与BreakIterator解决方案类似,但使用正则表达式迭代单词分隔符。

// Any number of white space or the end of the input
private final static Pattern SPACES_PATTERN = Pattern.compile("\\s+|\\z");

private static String truncateAfterWords(int n, String s) {
    if (s == null) return null;
    Matcher matcher = SPACES_PATTERN.matcher(s);
    int matchStartIndex = 0, matchEndIndex = 0, wordsFound = 0;
    // Keep matching until enough words are found, 
    // reached the end of the string, 
    // or no more matches
    while (wordsFound<n && matchEndIndex<s.length() && matcher.find(matchEndIndex)){
        // Keep track of both the start and end of each match
        matchStartIndex = matcher.start();
        matchEndIndex = matchStartIndex + matcher.group().length();
        // Only increment words found when not at the beginning of the string
        if (matchStartIndex != 0) wordsFound++;
    }
    // From the beginning of the string to the start of the final match
    return s.substring(0, matchStartIndex);
}

答案 3 :(得分:-1)

尝试在Java中使用正则表达式。只检索n个单词的正则表达式为:(.*?\s){n}

尝试使用代码:

String inputStr= "hello, this\nis a long sentence";
Pattern pattern = Pattern.compile("(.*?[\\s]){3}", Pattern.DOTALL); 
Matcher matcher = pattern.matcher(inputStr);
matcher.find(); 
String result = matcher.group(); 
System.out.println(result);

了解有关套餐的更多信息: