如何通过多个分隔符拆分字符串 - 并知道哪个分隔符匹配

时间:2014-11-13 12:08:41

标签: java regex algorithm string-parsing string-split

使用String.split可以很容易地通过多个分隔符拆分字符串。您只需要定义一个匹配您要使用的所有分隔符的正则表达式。例如

"1.22-3".split("[.-]")

会在列表中显示元素"1""22""3"。到目前为止一切都很好。

然而,现在我还需要知道在段之间找到了哪一个分隔符。有没有直接的方法来实现这个目标?

我查看了String.split,它已弃用的前任StringTokenizer,以及其他所谓的更现代的库(例如StrTokenizer from Apatche Commons),但是没有一个我可以抓住匹配的分隔符。

2 个答案:

答案 0 :(得分:2)

如果你追溯String.split(regex)所做的事情并记录String.split忽略的信息,那就很简单了:

String source = "1.22-3";
Matcher m=Pattern.compile("[.-]").matcher(source);
ArrayList<String> elements=new ArrayList<>();
ArrayList<String> separators=new ArrayList<>();
int pos;
for(pos=0; m.find(); pos=m.end()) {
    elements.add(source.substring(pos, m.start()));
    separators.add(m.group());
}
elements.add(source.substring(pos));

在此代码的末尾,separators.get(x)生成elements.get(x)elements.get(x+1)之间的分隔符。应该清楚separators是一个小于elements的项目。

如果要在一个列表中包含元素和分隔符,只需更改代码以使这两个列表成为相同的列表。这些项目已按发生顺序添加。

答案 1 :(得分:0)

我认为我正在寻找错误的算法来实现我的目标。以下两步法不是使用分隔符拆分方法,而是更成功:

  • 首先,我实现了一个lexer (aka tokenizer, scanner),它将字符串拆分为包含分隔符的标记。即将1.22-3分为1.22-3

  • 然后,我实现了一个解析这个令牌流的解析器,即区分段及其分隔符。


词法分析器的可能实现:

import java.util.ArrayList;
import java.util.List;

public final class FixedStringTokenScanner {

    /**
     * Splits the given input into tokens. Each token is either one of the given constant string
     * tokens or a string consisting of the other characters between the constant tokens.
     *
     * @param input
     *            The string to split.
     * @param fixedStringTokens
     *            A list of strings to be recognized as separate tokens.
     * @return A list of strings, which when concatenated would result in the input string.
     *         Occurrences of the fixed string tokens in the input string are returned as separate
     *         list entries. These entries are reference-equal to the respective fixedStringTokens
     *         entry. Characters which did not match any of the fixed string tokens are concatenated
     *         and returned as list entries at the respective positions in the list. The list does
     *         not contain empty or <code>null</code> entries.
     */
    public static List<String> splitToFixedStringTokensAndOtherTokens(final String input, final String... fixedStringTokens) {
        return new FixedStringTokenScannerRun(input, fixedStringTokens).splitToFixedStringAndOtherTokens();
    }

    private static class FixedStringTokenScannerRun {

        private final String input;
        private final String[] fixedStringTokens;

        private int scanIx = 0;
        StringBuilder otherContent = new StringBuilder();
        List<String> result = new ArrayList<String>();

        public FixedStringTokenScannerRun(final String input, final String[] fixedStringTokens) {
            this.input = input;
            this.fixedStringTokens = fixedStringTokens;
        }

        List<String> splitToFixedStringAndOtherTokens() {
            while (scanIx < input.length()) {
                scanIx += matchFixedStringOrAppendToOther();
            }
            storeOtherTokenIfNotEmpty();
            return result;
        }

        /**
         * @return the number of matched characters.
         */
        private int matchFixedStringOrAppendToOther() {
            for (String fixedString : fixedStringTokens) {
                if (input.regionMatches(scanIx, fixedString, 0, fixedString.length())) {
                    storeOtherTokenIfNotEmpty();
                    result.add(fixedString); // add string instance so that identity comparison works
                    return fixedString.length();
                }
            }
            appendCharacterToOther();
            return 1;
        }

        private void appendCharacterToOther() {
            otherContent.append(input.substring(scanIx, scanIx + 1));
        }

        private void storeOtherTokenIfNotEmpty() {
            if (otherContent.length() > 0) {
                result.add(otherContent.toString());
                otherContent.setLength(0);
            }
        }
    }
}