我需要将String
标记为如下:
看起来我可以通过
获得#1和#2String str = "blah blah";
String p1 = "[^a-zA-Z ]";
String p2 = "\\s+";
String[] tokens = str.replaceAll(p1, "").split(p2);
我可以将p1
修改为#3吗?作为替代方案,我可以做到
String p1 = "[^a-zA-Z ]";
String p2 = "\\s+";
String p3 = ""\\b\\w{1,2}\\b";
String[] tokens = str.replaceAll(p1, "").replaceAll(p3, "").split(p2);
p3
是否正确?
我也更愿意避免添加另一种模式(效率也会降低,对吧?)
答案 0 :(得分:0)
您可以将#2
和#3
合并为:
str = str.replaceAll("\\b[^a-zA-Z]*(?:[a-zA-Z][^a-zA-Z ]*){1,2}\\b|[^a-zA-Z ]+", "");
这将删除所有非字母/非空格和所有少于3个字符的单词。
答案 1 :(得分:0)
不,p3
不正确,因为你在拆分之前消除了一些空格,你没有考虑导致空格导致split()
返回空的前导值,你硬编码{{ 1}}。
使用此输入字符串进行测试:
N
以下是4个实现,最后列出了我的解决方案。使用String input = " A Aa AaA AaAa \r\n" +
" 1 11 111 1111 \r\n" +
" A1 A1a A1c1 A1a1A A1c1A1 A1a1A1a \r\n" +
" AeA\tAeAeA ";
调用时,它们将生成以下输出:
n = 3
[AaA, AaAa, AaA, AcA, AaAa, AeA, AeAeA] ← literalInterpretation
{, AaA, AaAa, AaA, AcA, AaAa, AeAAeAeA} ← fromQuestion
{, AaA, AaAa, AaA, AcA, AaAa, AeAAeAeA} ← answerByAnubhava
{AaA, AaAa, AaA, AcA, AaAa, AeA, AeAeA} ← answerByAnubhavaFixedByMe
{AaA, AaAa, AaA, AcA, AaAa, AeA, AeAeA} ← myAnswer
private static void literalInterpretation(int n, String input) {
// 1. Split by whitespace
String[] values = input.split("(?U)\\s+"); // Whitespaces (unicode character class)
// 2. Remove all non-letters
for (int i = 0; i < values.length; i++)
values[i] = values[i].replaceAll("\\P{L}+", ""); // Non-letters (unicode category)
// 3. Remove all letter tokens of length less than N
List<String> tokens = new ArrayList<>();
for (String value : values)
if (value.length() >= n)
tokens.add(value);
System.out.println(tokens);
}
private static void fromQuestion(int n, String input) {
String p1 = "[^a-zA-Z ]";
String p2 = "\\s+";
String p3 = "\\b\\w{1," + (n-1) + "}\\b";
String[] tokens = input.replaceAll(p1, "").replaceAll(p3, "").split(p2);
System.out.println(Arrays.toString(tokens));
}
private static void answerByAnubhava(int n, String input) {
String str = input.replaceAll("\\b(?:[a-zA-Z][^a-zA-Z ]*){1," + (n-1) + "}\\b|[^a-zA-Z ]+", "");
String[] tokens = str.split("\\s+");
System.out.println(Arrays.toString(tokens));
}
private static void answerByAnubhavaFixedByMe(int n, String input) {
String[] tokens = input.replaceAll("(?U)\\b[^\\p{L}\\s]*(?:\\p{L}[^\\p{L}\\s]*){1," + (n-1) + "}\\b|[^\\p{L}\\s]+", "")
.replaceFirst("(?U)^\\s+", "")
.split("(?U)\\s+");
System.out.println(Arrays.toString(tokens));
}