用于匹配文本中的c ++之类的单词的正则表达式模式

时间:2014-07-16 13:18:59

标签: java regex

我的文字可以包含任何格式的c ++,c,.net,asp.net等字样。

示例文字:

  

您好,java是我想要的。嗯.net也应该没问题。 C,C ++也是需要的。所以,给我C,C ++,Java,asp.net技能。

我已经将c,c ++,java,.net,asp.net存储在某处。 我只需要在文本中选择所有这些单词的出现次数。

我用来匹配的模式是(?i)\\b(" +Pattern.quote(key)+ ")\\b,它与c ++和.net之类的东西不匹配。所以我尝试使用(?i)\\b(" +forRegex(key)+ ")\\bmethod link here)来逃避文字,我得到了相同的结果。

预期输出是它应匹配(不区分大小写):

C++:2

C:2

java:2

asp.net:1

.net:1

3 个答案:

答案 0 :(得分:0)

Set<String> keywords; // add your keywords in this set;
String text="Hello, java is what I want. Hmm .net should be fine too. C, C++ are also need. So, get me C,C++,Java,asp.net skills.";
text=text.replaceAll("[, ; ]"," ");
String[] textArray=text.split(" ");
for(String s : keywords){
  int count=0;
  for(int i=0;i<textArray.length();i++){
    if(textArray[i].equals(s)){
      count++
    }
  }
  System.out.println(s + " : " + count);
}

大部分时间都可以使用。 (如果您想要更好的结果,请更改replaceAll方法上的正则表达式。)

答案 1 :(得分:0)

我会为您的问题选择非正则表达式解决方案。只需将关键字放入数组中,然后搜索输入字符串中的每个出现。它使用String.indexOf(String, int)遍历字符串而不创建任何新对象(在索引和计数器之外)。

public class SearchWordCountNonRegex  {
   public static final void main(String[] ignored)  {

      //Keywords and input searched for with lowercase, so the keyword "java"
      //matches "Java", "java", and "JAVA".

      String[] searchWords = {"c++", "c", "java", "asp.net", ".net"};
      String input = "Hello, java is what I want. Hmm .net should be fine too. C, C++ are also need. So, get me C,C++,Java,asp.net skills.".
         toLowerCase();

      for(int i = 0; i < searchWords.length; i++)  {
         String searchWord = searchWords[i];

         System.out.print(searchWord + ": ");

         int foundCount = 0;
         int currIdx = 0;
         while(currIdx != -1)  {
            currIdx = input.indexOf(searchWord, currIdx);

            if(currIdx != -1)  {
               foundCount++;
               currIdx += searchWord.length();
            }  else  {
               currIdx = -1;
            }
         }

         System.out.println(foundCount);

      }
   }
}

输出:

c++: 2
c: 4
java: 2
asp.net: 1
.net: 2

如果您真的想要一个正则表达式解决方案,可以尝试以下内容,它使用case insensitive模式匹配每个关键字。

问题是必须分别跟踪发生次数。例如,可以通过将每个找到的关键字添加到地图来完成,其中键是关键字,值是其当前计数。此外,一旦找到匹配项,搜索将继续从该点开始,这意味着隐藏了任何可能的重叠匹配项(例如,当找到Asp.NET时,将永远不会找到特定的.NET匹配项) - 这可能是也可能不是理想的行为。

   import  java.util.regex.Pattern;
   import  java.util.regex.Matcher;

public class SearchWordsRegexNoCounts  {
   public static final void main(String[] ignored)  {

      Matcher keywordMtchr = Pattern.compile("(C\\+\\+|C|Java|Asp\\.NET|\\.NET)",
         Pattern.CASE_INSENSITIVE).matcher("");

      String input = "Hello, java is what I want. Hmm .net should be fine too. C, C++ are also need. So, get me C,C++,Java,asp.net skills.";

      keywordMtchr.reset(input);
      while(keywordMtchr.find())  {
         System.out.println("Keyword found at index " + keywordMtchr.start() + ": " + keywordMtchr.group(1));
      }
   }
}

输出:

Keyword found at index 7: java
Keyword found at index 32: .net
Keyword found at index 57: C
Keyword found at index 60: C++
Keyword found at index 90: C
Keyword found at index 92: C++
Keyword found at index 96: Java
Keyword found at index 101: asp.net

答案 2 :(得分:0)

使用正则表达式我提出了以下解决方案。虽然它可能会发现不需要的匹配,如代码注释中所述:

// "\\" is first because we don't want to escape any escape characters we will
// be adding ourselves
private static final String[] regexSpecial = {"\\", "(", ")", "[", "]", "{",
    "}", ".", "+", "*", "?", "^", "$", "|"};

private static final String regexEscape = "\\";

private static final String[] regexEscapedSpecial;

static {
  regexEscapedSpecial = new String[regexSpecial.length];
  for (int i = 0; i < regexSpecial.length; i++) {
    regexEscapedSpecial[i] = regexEscape + regexSpecial[i];
  }
}

public static void main(String[] args) throws Throwable {
  Set<String> searchWords = new HashSet<String>(Arrays.asList("c++", "c",
      ".net", "asp.net", "java"));
  String text = "Hello, java is what I want. Hmm .net should be fine too. C, C++ are also need. So, get me\nC,C++,Java,asp.net skills.";

  System.out.println(numOccurrences(text, searchWords, false));
}

/**
 * Counts the number of occurrences of the given words in the given text. This
 * allows the given "words" to contain non-word characters. Note that it is
 * possible for unexpected matches to occur. For example if one of the words 
 * to match is "c" then while none of the "c"s in "coconut" will be matched, 
 * the "c" in "c-section" will even if only matches of "c" as in the "c
 * programming language" were intended.
 */
public static Map<String, Integer> numOccurrences(String text,
    Set<String> searchWords, boolean caseSensitive) {
  Map<String, String> lowerCaseToSearchWords = new HashMap<String, String>();
  List<String> searchWordsInOrder = sortByNonInclusion(searchWords);

  StringBuilder regex = new StringBuilder("(?<!\\w)(");
  boolean started = false;
  for (String searchWord : searchWordsInOrder) {
    lowerCaseToSearchWords.put(searchWord.toLowerCase(), searchWord);

    if (started) {
      regex.append("|");
    } else {
      started = true;
    }
    regex.append(escapeRegex(searchWord));
  }
  regex.append(")(?!\\w)");

  Pattern pattern = null;
  if (caseSensitive) {
    pattern = Pattern.compile(regex.toString());
  } else {
    pattern = Pattern.compile(regex.toString(), Pattern.CASE_INSENSITIVE);
  }
  Matcher matcher = pattern.matcher(text);

  Map<String, Integer> matches = new HashMap<String, Integer>();
  while (matcher.find()) {
    String match = lowerCaseToSearchWords.get(matcher.group(1).toLowerCase());
    Integer oldVal = matches.get(match);
    if (oldVal == null) {
      oldVal = 0;
    }
    matches.put(match, oldVal + 1);
  }

  return matches;
}

/**
 * Sorts the given collection of words in such a way that if A is a prefix of
 * B, then it is guaranteed that A will appear after B in the sorted list.
 */
public static List<String> sortByNonInclusion(Collection<String> toSort) {
  List<String> sorted = new ArrayList<String>(new HashSet<String>(toSort));
  // sorting in reverse alphabetical order will ensure that if A is a prefix
  // of B it will appear later in the list than B
  Collections.sort(sorted, new Comparator<String>() {

    @Override
    public int compare(String o1, String o2) {
      return o2.compareTo(o1);
    }
  });
  return sorted;
}

/**
 * Escape all regex special characters in the given text.
 */
public static String escapeRegex(String toEscape) {
  for (int i = 0; i < regexSpecial.length; i++) {
    toEscape = toEscape.replace(regexSpecial[i], regexEscapedSpecial[i]);
  }
  return toEscape;
}

打印结果是

{asp.net=1, c=2, c++=2, java=2, .net=1}