我该如何在下面使用拆分cretiria拆分此文本:FIRST,NOW,THEN:
String text = "FIRST i go to the homepage NOW i click on button \"NOW CLICK\" very quick THEN i will become a text result.";
预计有3个句子:
由于“立即点击”按钮,此代码无法正常工作
String[] textArray = text.split("FIRST|NOW|THEN");
答案 0 :(得分:4)
如果我对你的理解正确
FIRST
NOW
THEN
上分开文本,并将其保留在结果部分如果我的猜测正确,而不是split
方法,则可以使用find
遍历所有
这将使您添加所有引号和空格,以得到结果,并只专注于检查不在引号内的单词,以查看是否应拆分它们。
代表这些部分的正则表达式看起来像Pattern.compile("\"[^\"]*\"|\\S+|\\s+");
重要 :我们需要先搜索“ ..”,否则\\S+
也将"NOW CLICK"
匹配为"NOW
和CLICK"
作为两个独立的部分,这将阻止将其视为单引号。这就是为什么我们要在"[^"]*"
系列的开头放置subregex1|subregex2|subregex3
正则表达式(代表引号)。
此正则表达式将允许我们遍历文本
FIRST i go to the homepage NOW i click on button "NOW CLICK" very quick THEN i will become a text result.
作为令牌
FIRST
i
go
to
the
homepage
NOW
i
click
on
button
"NOW CLICK"
very
quick
THEN
i
will
become
a
text
result.
THEN
i
will
become
a
text
result.
请注意,"NOW CLICK"
将被视为单个令牌。因此,即使它要包含在要拆分的关键字内,也永远不会等于该关键字(因为它将包含其他字符,例如{{1 }},或其他引号)。这样可以防止将其视为应分隔文本的定界符。
使用此想法,我们可以创建如下代码:
"
输出:
String text = "FIRST i go to the homepage NOW i click on button \"NOW CLICK\" very quick THEN i will become a text result.";
List<String> keywordsToSplitOn = List.of("FIRST", "NOW", "THEN");
//lets search for quotes ".." | words | whitespaces
Pattern p = Pattern.compile("\"[^\"]*\"|\\S+|\\s+");
Matcher m = p.matcher(text);
StringBuilder sb = new StringBuilder();
List<String> result = new ArrayList<>();
while(m.find()){
String token = m.group();
if (keywordsToSplitOn.contains(token) && sb.length() != 0){
result.add(sb.toString());
sb.delete(0, sb.length());//clear sb
}
sb.append(token);
}
if (sb.length() != 0){//include rest of text after last keyword
result.add(sb.toString());
}
result.forEach(System.out::println);
答案 1 :(得分:3)
您需要使用先行和后备(简短介绍here)。
只需将split方法中的正则表达式更改为以下内容即可:
String[] textArray = text.split("((?=FIRST)|(?=NOW(?! CLICK))|(?=THEN))");
甚至最好在每个表达式中都包含一个空格,以防止在例如NOWHERE上分割:
String[] textArray = text.split("((?=FIRST )|(?=NOW (?!CLICK))|(?=THEN ))");
答案 2 :(得分:1)
您可以使用模式和匹配器按组划分输入:
Pattern pattern = Pattern.compile("^(FIRST.*?)(NOW.*?)(THEN.*)$");
String text = "FIRST i go to the homepage NOW i click on button \"NOW CLICK\" very quick THEN i will become a text result.";
Matcher matcher = pattern.matcher(text);
if (matcher.find()) {
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
System.out.println(matcher.group(3));
}
输出:
FIRST i go to the homepage
NOW i click on button "NOW CLICK" very quick
THEN i will become a text result.
答案 3 :(得分:1)
您可以匹配以下正则表达式。
/\bFIRST +(?:(?!\bNOW\b)[^\n])+(?<! )|\bNOW +(?:(?!\bTHEN\b)[^\n])+(?<! )|\bTHEN +.*/
Java的正则表达式引擎执行以下操作。
\bFIRST + : match 'FIRST' preceded by a word boundary,
followed by 1+ spaces
(?: : begin a non-capture group
(?!\bNOW\b) : use a negative lookahead to assert that
the following chars are not 'NOW'
[^\n] : match any char other than a line terminator
) : end non-capture group
+ : execute non-capture group 1+ times
(?<! ) : use negative lookbehind to assert that the
previous char is not a space
| : or
\bNOW + : match 'NOW' preceded by a word boundary,
followed by 1+ spaces
(?: : begin a non-capture group
(?!\bTHEN\b) : use a negative lookahead to assert that
the following chars are not 'THEN'
[^\n] : match any char other than a line terminator
) : end non-capture group
+ : execute non-capture group 1+ times
(?<! ) : use negative lookbehind to assert that the
previous char is not a space
| : or
\bTHEN +.* : match 'THEN' preceded by a word boundary,
followed by 1+ spaces then 0+ chars
这使用了一种称为tempered greedy token solution的技术。
答案 4 :(得分:0)
您可以使用以下here):
public static void main(String args[]) {
String text = "FIRST i go to the homepage NOW i click on button \"NOW CLICK\" very quick THEN i will become a text result.";
String[] textArray = text.split("(?=FIRST)|(?=\\b NOW \\b)|(?=THEN)");
for(String s: textArray) {
System.out.println(s);
}
}
输出:
FIRST i go to the homepage
NOW i click on button "NOW CLICK" very quick
THEN i will become a text result.