我的文字可以包含任何格式的c ++,c,.net,asp.net等字样。
示例文字:
您好,java是我想要的。嗯.net也应该没问题。 C,C ++也是需要的。所以,给我C,C ++,Java,asp.net技能。
我已经将c,c ++,java,.net,asp.net存储在某处。 我只需要在文本中选择所有这些单词的出现次数。
我用来匹配的模式是(?i)\\b(" +Pattern.quote(key)+ ")\\b
,它与c ++和.net之类的东西不匹配。所以我尝试使用(?i)\\b(" +forRegex(key)+ ")\\b
(method link here)来逃避文字,我得到了相同的结果。
预期输出是它应匹配(不区分大小写):
C++
:2
C
:2
java
:2
asp.net
:1
.net
:1
答案 0 :(得分:0)
Set<String> keywords; // add your keywords in this set;
String text="Hello, java is what I want. Hmm .net should be fine too. C, C++ are also need. So, get me C,C++,Java,asp.net skills.";
text=text.replaceAll("[, ; ]"," ");
String[] textArray=text.split(" ");
for(String s : keywords){
int count=0;
for(int i=0;i<textArray.length();i++){
if(textArray[i].equals(s)){
count++
}
}
System.out.println(s + " : " + count);
}
大部分时间都可以使用。 (如果您想要更好的结果,请更改replaceAll方法上的正则表达式。)
答案 1 :(得分:0)
我会为您的问题选择非正则表达式解决方案。只需将关键字放入数组中,然后搜索输入字符串中的每个出现。它使用String.indexOf(String, int)
遍历字符串而不创建任何新对象(在索引和计数器之外)。
public class SearchWordCountNonRegex {
public static final void main(String[] ignored) {
//Keywords and input searched for with lowercase, so the keyword "java"
//matches "Java", "java", and "JAVA".
String[] searchWords = {"c++", "c", "java", "asp.net", ".net"};
String input = "Hello, java is what I want. Hmm .net should be fine too. C, C++ are also need. So, get me C,C++,Java,asp.net skills.".
toLowerCase();
for(int i = 0; i < searchWords.length; i++) {
String searchWord = searchWords[i];
System.out.print(searchWord + ": ");
int foundCount = 0;
int currIdx = 0;
while(currIdx != -1) {
currIdx = input.indexOf(searchWord, currIdx);
if(currIdx != -1) {
foundCount++;
currIdx += searchWord.length();
} else {
currIdx = -1;
}
}
System.out.println(foundCount);
}
}
}
输出:
c++: 2
c: 4
java: 2
asp.net: 1
.net: 2
如果您真的想要一个正则表达式解决方案,可以尝试以下内容,它使用case insensitive模式匹配每个关键字。
问题是必须分别跟踪发生次数。例如,可以通过将每个找到的关键字添加到地图来完成,其中键是关键字,值是其当前计数。此外,一旦找到匹配项,搜索将继续从该点开始,这意味着隐藏了任何可能的重叠匹配项(例如,当找到Asp.NET
时,将永远不会找到特定的.NET
匹配项) - 这可能是也可能不是理想的行为。
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class SearchWordsRegexNoCounts {
public static final void main(String[] ignored) {
Matcher keywordMtchr = Pattern.compile("(C\\+\\+|C|Java|Asp\\.NET|\\.NET)",
Pattern.CASE_INSENSITIVE).matcher("");
String input = "Hello, java is what I want. Hmm .net should be fine too. C, C++ are also need. So, get me C,C++,Java,asp.net skills.";
keywordMtchr.reset(input);
while(keywordMtchr.find()) {
System.out.println("Keyword found at index " + keywordMtchr.start() + ": " + keywordMtchr.group(1));
}
}
}
输出:
Keyword found at index 7: java
Keyword found at index 32: .net
Keyword found at index 57: C
Keyword found at index 60: C++
Keyword found at index 90: C
Keyword found at index 92: C++
Keyword found at index 96: Java
Keyword found at index 101: asp.net
答案 2 :(得分:0)
使用正则表达式我提出了以下解决方案。虽然它可能会发现不需要的匹配,如代码注释中所述:
// "\\" is first because we don't want to escape any escape characters we will
// be adding ourselves
private static final String[] regexSpecial = {"\\", "(", ")", "[", "]", "{",
"}", ".", "+", "*", "?", "^", "$", "|"};
private static final String regexEscape = "\\";
private static final String[] regexEscapedSpecial;
static {
regexEscapedSpecial = new String[regexSpecial.length];
for (int i = 0; i < regexSpecial.length; i++) {
regexEscapedSpecial[i] = regexEscape + regexSpecial[i];
}
}
public static void main(String[] args) throws Throwable {
Set<String> searchWords = new HashSet<String>(Arrays.asList("c++", "c",
".net", "asp.net", "java"));
String text = "Hello, java is what I want. Hmm .net should be fine too. C, C++ are also need. So, get me\nC,C++,Java,asp.net skills.";
System.out.println(numOccurrences(text, searchWords, false));
}
/**
* Counts the number of occurrences of the given words in the given text. This
* allows the given "words" to contain non-word characters. Note that it is
* possible for unexpected matches to occur. For example if one of the words
* to match is "c" then while none of the "c"s in "coconut" will be matched,
* the "c" in "c-section" will even if only matches of "c" as in the "c
* programming language" were intended.
*/
public static Map<String, Integer> numOccurrences(String text,
Set<String> searchWords, boolean caseSensitive) {
Map<String, String> lowerCaseToSearchWords = new HashMap<String, String>();
List<String> searchWordsInOrder = sortByNonInclusion(searchWords);
StringBuilder regex = new StringBuilder("(?<!\\w)(");
boolean started = false;
for (String searchWord : searchWordsInOrder) {
lowerCaseToSearchWords.put(searchWord.toLowerCase(), searchWord);
if (started) {
regex.append("|");
} else {
started = true;
}
regex.append(escapeRegex(searchWord));
}
regex.append(")(?!\\w)");
Pattern pattern = null;
if (caseSensitive) {
pattern = Pattern.compile(regex.toString());
} else {
pattern = Pattern.compile(regex.toString(), Pattern.CASE_INSENSITIVE);
}
Matcher matcher = pattern.matcher(text);
Map<String, Integer> matches = new HashMap<String, Integer>();
while (matcher.find()) {
String match = lowerCaseToSearchWords.get(matcher.group(1).toLowerCase());
Integer oldVal = matches.get(match);
if (oldVal == null) {
oldVal = 0;
}
matches.put(match, oldVal + 1);
}
return matches;
}
/**
* Sorts the given collection of words in such a way that if A is a prefix of
* B, then it is guaranteed that A will appear after B in the sorted list.
*/
public static List<String> sortByNonInclusion(Collection<String> toSort) {
List<String> sorted = new ArrayList<String>(new HashSet<String>(toSort));
// sorting in reverse alphabetical order will ensure that if A is a prefix
// of B it will appear later in the list than B
Collections.sort(sorted, new Comparator<String>() {
@Override
public int compare(String o1, String o2) {
return o2.compareTo(o1);
}
});
return sorted;
}
/**
* Escape all regex special characters in the given text.
*/
public static String escapeRegex(String toEscape) {
for (int i = 0; i < regexSpecial.length; i++) {
toEscape = toEscape.replace(regexSpecial[i], regexEscapedSpecial[i]);
}
return toEscape;
}
打印结果是
{asp.net=1, c=2, c++=2, java=2, .net=1}