我正试图将段落分成句子。段落可以有像F.C.B这样的单词,它还包括一些像锚和其他标签的html标签。我试图使用如下所示,但通过生活html标签,将我的段落与特定句子分开是不完美的。
String.split("(?<!\\.[a-zA-Z])\\.(?![a-zA-Z]\\.)(?![<[^>]*>])");
请有没有人可以帮助我更好地表达正确的想法或任何想法?
答案 0 :(得分:1)
你可以试试这个:
String par = "In 2004, Obama received national attention during his campaign to represent Illinois in the United States Senate with his victory in the March Democratic Party primary, his keynote address at the Democratic National Convention in July, and his election to the Senate in November. He began his presidential campaign in 2007 and, after a close primary campaign against Hillary Clinton in 2008, he won sufficient delegates in the Democratic Party primaries to receive the presidential nomination.";
Pattern pattern = Pattern.compile("[^.!?\\s][^.!?]*(?:[.!?](?!['\"]?\\s|$)[^.!?]*)*[.!?]?['\"]?(?=\\s|$)", Pattern.MULTILINE | Pattern.COMMENTS);
Matcher matcher = pattern.matcher(par);
while (matcher.find()) {
System.out.println(matcher.group());
}
让我知道它是否有效
答案 1 :(得分:1)
不是分割字符,而是更容易匹配并捕获每个句子子字符串
(?:<(?:(?:[a-z]+\s(?:[^>=]|='[^']*'|="[^"]*"|=[^'"\s]*)*"\s?\/?|\/[a-z]+)>)|(?:(?!<)(?:[^.?!]|[.?!](?=\S)))*)+[.?!]
此正则表达式将执行以下操作:
F.C.B
注意:您需要撤消所有\
,因此它们看起来像\\
现场演示
https://regex101.com/r/fJ9zS0/3
示例文字
I am was trying to split paragraph to sentences. The paragraph can have a word like F.C.B also it includes some html tag like anchor and other tags. I was trying to use like below but it was not perfect separating my paragraph to the specific sentence by living the html tag as it is.
In 2004, he <a href="http://test.pic.org/jpeg."> received </a> national attention during his Party primary, his keynote address July, <a onmouseover=" fnRotator('I like droids. '); "> and </a> his election to the Senate in November. He began his presidential campaign in he won sufficient delegates in the Democratic Party primaries to receive the presidential nomination.
样本匹配
Java Code Example:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
public static void main(String[] asd){
String sourcestring = " ----your source string goes here----- ";
Pattern re = Pattern.compile("(?:<(?:(?:[a-z]+\\s(?:[^>=]|='[^']*'|=\"[^\"]*\"|=[^'\"\\s]*)*\"\\s?\\/?|\\/[a-z]+)>)|(?:(?!<)(?:[^.?!]|[.?!](?=\\S)))*)+[.?!]",Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);
Matcher m = re.matcher(sourcestring);
int mIdx = 0;
while (m.find()){
for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
}
mIdx++;
}
}
}
示例输出
$matches Array:
(
[0] => Array
(
[0] => I am was trying to split paragraph to sentences.
[1] => The paragraph can have a word like F.C.B also it includes some html tag like anchor and other tags.
[2] => I was trying to use like below but it was not perfect separating my paragraph to the specific sentence by living the html tag as it is.
[3] =>
In 2004, he <a href="http://test.pic.org/jpeg."> received </a> national attention during his Party primary, his keynote address July, <a onmouseover=" fnRotator('I like droids. '); "> and </a> his election to the Senate in November.
[4] => He began his presidential campaign in he won sufficient delegates in the Democratic Party primaries to receive the presidential nomination.
)
)
NODE EXPLANATION
----------------------------------------------------------------------
(?: group, but do not capture (1 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
< '<'
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
[a-z]+ any character of: 'a' to 'z' (1 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or
more times (matching the most
amount possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or
more times (matching the most
amount possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"\s]* any character except: ''', '"',
whitespace (\n, \r, \t, \f, and "
") (0 or more times (matching the
most amount possible))
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
\s? whitespace (\n, \r, \t, \f, and " ")
(optional (matching the most amount
possible))
----------------------------------------------------------------------
\/? '/' (optional (matching the most
amount possible))
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
\/ '/'
----------------------------------------------------------------------
[a-z]+ any character of: 'a' to 'z' (1 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
< '<'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
[^.?!] any character except: '.', '?', '!'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[.?!] any character of: '.', '?', '!'
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
\S non-whitespace (all but \n, \r,
\t, \f, and " ")
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
)+ end of grouping
----------------------------------------------------------------------
[.?!] any character of: '.', '?', '!'
----------------------------------------------------------------------