首先,我知道有类似的问题,例如:
How to split a string, but also keep the delimiters?
但是,我遇到了使用Pattern.split()实现字符串拆分的问题,其中模式基于分隔符列表,但有时它们似乎重叠。这是一个例子:
目标是基于一组已被斜线包围的已知代码字来分割字符串,其中我需要保留分隔符(代码字)本身和它之后的值(可能是空字符串)。
对于此示例,代码字为:
/ABC/
/DEF/
/GHI/
基于上面引用的线程,使用look-ahead和look-behind将字符串标记为代码字AND值,如下构建模式:
((?<=/ABC/)|(?=/ABC/))|((?<=/DEF/)|(?=/DEF/))|((?<=/GHI/)|(?=/GHI/))
工作字符串:
"123/ABC//DEF/456/GHI/789"
使用split,这很好地标记为:
"123","/ABC/","/DEF/","456","/GHI/","789"
问题字符串(注意“ABC”和“DEF”之间的单斜杠):
"123/ABC/DEF/456/GHI/789"
这里的期望是“DEF / 456”是“/ ABC /”代码字之后的值,因为“DEF /”位实际上不是代码字,但恰好看起来像一个!
期望的结果是:
"123","/ABC/","DEF/456","/GHI/","789"
实际结果是:
"123","/ABC","/","DEF/","456","/GHI/","789"
正如您所看到的,“ABC”和“DEF”之间的斜线正在被孤立为一个标记本身。
我已经尝试过使用其他线程的解决方案,仅使用前瞻或后视,但它们似乎都遇到了同样的问题。任何帮助表示赞赏!
答案 0 :(得分:2)
如果您使用find
而不是split
,使用一些非贪婪的匹配,请尝试以下操作:
public class SampleJava {
static final String[] CODEWORDS = {
"ABC",
"DEF",
"GHI"};
static public void main(String[] args) {
String input = "/ABC/DEF/456/GHI/789";
String codewords = Arrays.stream(CODEWORDS)
.collect(Collectors.joining("|", "/(", ")/"));
// codewords = "/(ABC|DEF|GHI)/";
Pattern p = Pattern.compile(
/* codewords */ ("(DELIM)"
/* pre-delim */ + "|(.+?(?=DELIM))"
/* final bit */ + "|(.+?$)").replace("DELIM", codewords));
Matcher m = p.matcher(input);
while(m.find()) {
System.out.print(m.group(0));
if(m.group(1) != null) {
System.out.print(" ← code word");
}
System.out.println();
}
}
}
<强>输出:强>
/ ABC /←代码字
DEF / 456
/ GHI /←代码字
789
答案 1 :(得分:1)
使用正面和负面外观的组合:
String[] parts = s.split("(?<=/(ABC|DEF|GHI)/)(?<!/(ABC|DEF|GHI)/....)|(?=/(ABC|DEF|GHI)/)(?<!/(ABC|DEF|GHI))");
通过在单个前瞻/后方内部使用替换,也有相当大的简化。
请参阅live demo。
答案 2 :(得分:0)
遵循一些TDD principles(Red-Green-Refactor),以下是我将如何实现此类行为:
我定义了一组单元测试,解释了我如何理解你的标记化过程&#34;。如果任何测试根据您的预期不正确,请随时告诉我,我会相应地编辑我的答案。
import static org.assertj.core.api.Assertions.assertThat;
import java.util.List;
import org.junit.Test;
public class TokenizerSpec {
Tokenizer tokenizer = new Tokenizer("/ABC/", "/DEF/", "/GHI/");
@Test
public void itShouldTokenizeTwoConsecutiveCodewords() {
String input = "123/ABC//DEF/456";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("123", "/ABC/", "/DEF/", "456");
}
@Test
public void itShouldTokenizeMisleadingCodeword() {
String input = "123/ABC/DEF/456/GHI/789";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("123", "/ABC/", "DEF/456", "/GHI/", "789");
}
@Test
public void itShouldTokenizeWhenValueContainsSlash() {
String input = "1/23/ABC/456";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("1/23", "/ABC/", "456");
}
@Test
public void itShouldTokenizeWithoutCodewords() {
String input = "123/456/789";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("123/456/789");
}
@Test
public void itShouldTokenizeWhenEndingWithCodeword() {
String input = "123/ABC/";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("123", "/ABC/");
}
@Test
public void itShouldTokenizeWhenStartingWithCodeword() {
String input = "/ABC/123";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("/ABC/", "123");
}
@Test
public void itShouldTokenizeWhenOnlyCodeword() {
String input = "/ABC//DEF//GHI/";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("/ABC/", "/DEF/", "/GHI/");
}
}
此课程使上述所有测试通过
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Optional;
public final class Tokenizer {
private final List<String> codewords;
public Tokenizer(String... codewords) {
this.codewords = Arrays.asList(codewords);
}
public List<String> splitPreservingCodewords(String input) {
List<String> tokens = new ArrayList<>();
int lastIndex = 0;
int i = 0;
while (i < input.length()) {
final int idx = i;
Optional<String> codeword = codewords.stream()
.filter(cw -> input.substring(idx).indexOf(cw) == 0)
.findFirst();
if (codeword.isPresent()) {
if (i > lastIndex) {
tokens.add(input.substring(lastIndex, i));
}
tokens.add(codeword.get());
i += codeword.get().length();
lastIndex = i;
} else {
i++;
}
}
if (i > lastIndex) {
tokens.add(input.substring(lastIndex, i));
}
return tokens;
}
}
目前尚未完成(现在我没有足够的时间花在答案上)。如果您(但稍后)请求我,我会很高兴地Tokenizer
对Name
做一些重构。 :-)或者你可以非常安全地完成它,因为你有单元测试以避免回归。