Question

我该如何在下面使用拆分cretiria拆分此文本：FIRST，NOW，THEN：

String text = "FIRST i go to the homepage NOW i click on button \"NOW CLICK\" very quick THEN i will become a text result.";

预计有3个句子：

首先我进入首页
现在我可以非常迅速地单击“立即单击”按钮
我将成为文本结果。

由于“立即点击”按钮，此代码无法正常工作

String[] textArray = text.split("FIRST|NOW|THEN");

Answer 1

如果我对你的理解正确

想要在关键字FIRST NOW THEN上分开文本，并将其保留在结果部分
但如果这些关键字出现在引号中，则不想分开。

如果我的猜测正确，而不是split方法，则可以使用find遍历所有

报价
不在引号内的单词
空白。

这将使您添加所有引号和空格，以得到结果，并只专注于检查不在引号内的单词，以查看是否应拆分它们。

代表这些部分的正则表达式看起来像Pattern.compile("\"[^\"]*\"|\\S+|\\s+");

重要：我们需要先搜索“ ..”，否则\\S+也将"NOW CLICK"匹配为"NOW和CLICK"作为两个独立的部分，这将阻止将其视为单引号。这就是为什么我们要在"[^"]*"系列的开头放置subregex1|subregex2|subregex3正则表达式（代表引号）。

此正则表达式将允许我们遍历文本

FIRST i go to the homepage NOW i click on button "NOW CLICK" very quick THEN i will become a text result.

作为令牌

FIRST i go to the homepage NOW i click on button "NOW CLICK" very quick THEN i will become a text result. THEN i will become a text result.

请注意，"NOW CLICK"将被视为单个令牌。因此，即使它要包含在要拆分的关键字内，也永远不会等于该关键字（因为它将包含其他字符，例如{{1 }}，或其他引号）。这样可以防止将其视为应分隔文本的定界符。

使用此想法，我们可以创建如下代码：

输出：

String text = "FIRST i go to the homepage NOW i click on button \"NOW CLICK\" very quick THEN i will become a text result.";
List<String> keywordsToSplitOn = List.of("FIRST", "NOW", "THEN");

//lets search for quotes ".." | words | whitespaces
Pattern p = Pattern.compile("\"[^\"]*\"|\\S+|\\s+");
Matcher m = p.matcher(text);

StringBuilder sb = new StringBuilder();
List<String> result = new ArrayList<>();
while(m.find()){
    String token = m.group();
    if (keywordsToSplitOn.contains(token) && sb.length() != 0){
        result.add(sb.toString());
        sb.delete(0, sb.length());//clear sb
    }
    sb.append(token);
}
if (sb.length() != 0){//include rest of text after last keyword 
    result.add(sb.toString());
}

result.forEach(System.out::println);

Answer 2

您需要使用先行和后备（简短介绍here）。

只需将split方法中的正则表达式更改为以下内容即可：

String[] textArray = text.split("((?=FIRST)|(?=NOW(?! CLICK))|(?=THEN))");

甚至最好在每个表达式中都包含一个空格，以防止在例如NOWHERE上分割：

String[] textArray = text.split("((?=FIRST )|(?=NOW (?!CLICK))|(?=THEN ))");

Answer 3

您可以使用模式和匹配器按组划分输入：

Pattern pattern = Pattern.compile("^(FIRST.*?)(NOW.*?)(THEN.*)$");

String text = "FIRST i go to the homepage NOW i click on button \"NOW CLICK\" very quick THEN i will become a text result.";

Matcher matcher = pattern.matcher(text);
        
if (matcher.find()) {
    System.out.println(matcher.group(1));
    System.out.println(matcher.group(2));
    System.out.println(matcher.group(3));
}

输出：

FIRST i go to the homepage 
NOW i click on button "NOW CLICK" very quick 
THEN i will become a text result.

Answer 4

您可以匹配以下正则表达式。

/\bFIRST +(?:(?!\bNOW\b)[^\n])+(?<! )|\bNOW +(?:(?!\bTHEN\b)[^\n])+(?<! )|\bTHEN +.*/

Start your engine!

Java的正则表达式引擎执行以下操作。

\bFIRST +      : match 'FIRST' preceded by a word boundary,
                 followed by 1+ spaces
(?:            : begin a non-capture group
  (?!\bNOW\b)  : use a negative lookahead to assert that
                 the following chars are not 'NOW'  
  [^\n]        : match any char other than a line terminator
)              : end non-capture group
+              : execute non-capture group 1+ times
(?<! )         : use negative lookbehind to assert that the
                 previous char is not a space
|              : or
\bNOW +        : match 'NOW' preceded by a word boundary,
                 followed by 1+ spaces
(?:            : begin a non-capture group
  (?!\bTHEN\b) : use a negative lookahead to assert that
                 the following chars are not 'THEN'  
  [^\n]        : match any char other than a line terminator
)              : end non-capture group
+              : execute non-capture group 1+ times
(?<! )         : use negative lookbehind to assert that the
                 previous char is not a space
|              : or
\bTHEN +.*     : match 'THEN' preceded by a word boundary,
                 followed by 1+ spaces then 0+ chars

这使用了一种称为tempered greedy token solution的技术。

Answer 5

您可以使用以下here）：

public static void main(String args[]) { 
    String text = "FIRST i go to the homepage NOW i click on button \"NOW CLICK\" very quick THEN i will become a text result.";
    String[] textArray = text.split("(?=FIRST)|(?=\\b NOW \\b)|(?=THEN)");
    
    for(String s: textArray) {
        System.out.println(s);
    }
}

输出：

FIRST i go to the homepage
 NOW i click on button "NOW CLICK" very quick 
THEN i will become a text result.

拆分字符串不包含java中的字符串

5 个答案: