从Java

时间:2015-11-11 07:14:28

标签: java string algorithm performance arraylist

我对此进行了大量搜索,并且大多数帖子都在谈论在两个arraylists之间寻找可以用Collections.retainAll完成的公共字符串或者包含与文本进行比较的单个单词的ArrayList。

我的文字可能在Java中看起来像这样。

String text = "Get a placement right today by applying to our interviews and don't forget to email us your resume. This is a top job opportunity to get yourself acquainted with real world programming and skill building. Hurry! apply for placement now here";

我有一个ArrayList,可以说2个字符串,“placement”和“job opportunity”

我希望结果为展示位置(2)和工作机会(1) 我目前有几种方法,但我想知道实现这一目标的最佳方法。

方法1 为ArrayList中的每个单词维护一个计数器。对于ArrayList中的每个单词,执行text.contains(word),如果为true,则递增相应的计数器,如果文本中的单词多于ArrayList,或者ArrayList中的单词多于此处的文本,会发生什么?是否有任何最佳或更短的方法来实现相同的目标?我的ArrayList中可能有单词或短语。提前感谢您的建议。

2 个答案:

答案 0 :(得分:3)

一种简单的方法是使用String.indexOf搜索列表中的每个单词:

for (String word : list) {
  int prev = -1;
  int count = 0;
  do {
    prev = s.indexOf(word, prev + 1);
    if (prev != -1 /* && check for word breaks */) {
      count++
    }
  } while (prev != -1);
  System.out.println(word + " " + count);
}

然而,除了简单性之外,这并不是针对任何特定标准而设计的。

请注意,这不会检查分词,因此会在"foo"中找到"xfoox";有可能改变我指示寻找这些条件的条件。

如果您需要处理非常大的单词列表,像Aho-Corasick这样的算法会更有效,因为这样可以避免检查列表中的所有字符串。但是,它需要对单词列表进行一些预处理,尽管这可以合理有效地实现,并且可以在离线时针对给定的单词列表完成。

答案 1 :(得分:2)

如果我理解正确,这个问题就是模式匹配问题的一个例子。 This wikipedia page列出了最佳字符串搜索算法及其平均和最差情况的复杂性。 如果我没记错的话,Alfred V. Aho设计和分析算法,Jerffery Ullman和John E. Hopcroft对模式匹配章节中的Finite-state automaton based search进行了分析。

以下两个似乎效率最高。

  1. KMP matching algorithm
  2. Boyer Moore string search algorithm
  3. 我在http://algs4.cs.princeton.edu找到了这两种算法的实现 如果链接断开,我也会在这里复制文件。 实现:

    1. KMP implementation(时间复杂度Θ(m)+Θ(n))
    2. Boyer Moore implementation(时间复杂度Θ(m + k)+ O(n))
    3. StdOut 只是 System.out

      备份KMP:

      /******************************************************************************
       *  Compilation:  javac KMP.java
       *  Execution:    java KMP pattern text
       *  Dependencies: StdOut.java
       *
       *  Reads in two strings, the pattern and the input text, and
       *  searches for the pattern in the input text using the
       *  KMP algorithm.
       *
       *  % java KMP abracadabra abacadabrabracabracadabrabrabracad
       *  text:    abacadabrabracabracadabrabrabracad 
       *  pattern:               abracadabra          
       *
       *  % java KMP rab abacadabrabracabracadabrabrabracad
       *  text:    abacadabrabracabracadabrabrabracad 
       *  pattern:         rab
       *
       *  % java KMP bcara abacadabrabracabracadabrabrabracad
       *  text:    abacadabrabracabracadabrabrabracad 
       *  pattern:                                   bcara
       *
       *  % java KMP rabrabracad abacadabrabracabracadabrabrabracad 
       *  text:    abacadabrabracabracadabrabrabracad
       *  pattern:                        rabrabracad
       *
       *  % java KMP abacad abacadabrabracabracadabrabrabracad
       *  text:    abacadabrabracabracadabrabrabracad
       *  pattern: abacad
       *
       ******************************************************************************/
      
      /**
       *  The <tt>KMP</tt> class finds the first occurrence of a pattern string
       *  in a text string.
       *  <p>
       *  This implementation uses a version of the Knuth-Morris-Pratt substring search
       *  algorithm. The version takes time as space proportional to
       *  <em>N</em> + <em>M R</em> in the worst case, where <em>N</em> is the length
       *  of the text string, <em>M</em> is the length of the pattern, and <em>R</em>
       *  is the alphabet size.
       *  <p>
       *  For additional documentation,
       *  see <a href="http://algs4.cs.princeton.edu/53substring">Section 5.3</a> of
       *  <i>Algorithms, 4th Edition</i> by Robert Sedgewick and Kevin Wayne.
       */
      public class KMP {
          private final int R;       // the radix
          private int[][] dfa;       // the KMP automoton
      
          private char[] pattern;    // either the character array for the pattern
          private String pat;        // or the pattern string
      
          /**
           * Preprocesses the pattern string.
           *
           * @param pat the pattern string
           */
          public KMP(String pat) {
              this.R = 256;
              this.pat = pat;
      
              // build DFA from pattern
              int M = pat.length();
              dfa = new int[R][M]; 
              dfa[pat.charAt(0)][0] = 1; 
              for (int X = 0, j = 1; j < M; j++) {
                  for (int c = 0; c < R; c++) 
                      dfa[c][j] = dfa[c][X];     // Copy mismatch cases. 
                  dfa[pat.charAt(j)][j] = j+1;   // Set match case. 
                  X = dfa[pat.charAt(j)][X];     // Update restart state. 
              } 
          } 
      
          /**
           * Preprocesses the pattern string.
           *
           * @param pattern the pattern string
           * @param R the alphabet size
           */
          public KMP(char[] pattern, int R) {
              this.R = R;
              this.pattern = new char[pattern.length];
              for (int j = 0; j < pattern.length; j++)
                  this.pattern[j] = pattern[j];
      
              // build DFA from pattern
              int M = pattern.length;
              dfa = new int[R][M]; 
              dfa[pattern[0]][0] = 1; 
              for (int X = 0, j = 1; j < M; j++) {
                  for (int c = 0; c < R; c++) 
                      dfa[c][j] = dfa[c][X];     // Copy mismatch cases. 
                  dfa[pattern[j]][j] = j+1;      // Set match case. 
                  X = dfa[pattern[j]][X];        // Update restart state. 
              } 
          } 
      
          /**
           * Returns the index of the first occurrrence of the pattern string
           * in the text string.
           *
           * @param  txt the text string
           * @return the index of the first occurrence of the pattern string
           *         in the text string; N if no such match
           */
          public int search(String txt) {
      
              // simulate operation of DFA on text
              int M = pat.length();
              int N = txt.length();
              int i, j;
              for (i = 0, j = 0; i < N && j < M; i++) {
                  j = dfa[txt.charAt(i)][j];
              }
              if (j == M) return i - M;    // found
              return N;                    // not found
          }
      
          /**
           * Returns the index of the first occurrrence of the pattern string
           * in the text string.
           *
           * @param  text the text string
           * @return the index of the first occurrence of the pattern string
           *         in the text string; N if no such match
           */
          public int search(char[] text) {
      
              // simulate operation of DFA on text
              int M = pattern.length;
              int N = text.length;
              int i, j;
              for (i = 0, j = 0; i < N && j < M; i++) {
                  j = dfa[text[i]][j];
              }
              if (j == M) return i - M;    // found
              return N;                    // not found
          }
      
      
          /** 
           * Takes a pattern string and an input string as command-line arguments;
           * searches for the pattern string in the text string; and prints
           * the first occurrence of the pattern string in the text string.
           */
          public static void main(String[] args) {
              String pat = args[0];
              String txt = args[1];
              char[] pattern = pat.toCharArray();
              char[] text    = txt.toCharArray();
      
              KMP kmp1 = new KMP(pat);
              int offset1 = kmp1.search(txt);
      
              KMP kmp2 = new KMP(pattern, 256);
              int offset2 = kmp2.search(text);
      
              // print results
              StdOut.println("text:    " + txt);
      
              StdOut.print("pattern: ");
              for (int i = 0; i < offset1; i++)
                  StdOut.print(" ");
              StdOut.println(pat);
      
              StdOut.print("pattern: ");
              for (int i = 0; i < offset2; i++)
                  StdOut.print(" ");
              StdOut.println(pat);
          }
      }
      

      备份Boyer Moore:

      BoyerMoore.java
      
      
      Below is the syntax highlighted version of BoyerMoore.java from §5.3 Substring Search.
      
      
      /******************************************************************************
       *  Compilation:  javac BoyerMoore.java
       *  Execution:    java BoyerMoore pattern text
       *  Dependencies: StdOut.java
       *
       *  Reads in two strings, the pattern and the input text, and
       *  searches for the pattern in the input text using the
       *  bad-character rule part of the Boyer-Moore algorithm.
       *  (does not implement the strong good suffix rule)
       *
       *  % java BoyerMoore abracadabra abacadabrabracabracadabrabrabracad
       *  text:    abacadabrabracabracadabrabrabracad 
       *  pattern:               abracadabra
       *
       *  % java BoyerMoore rab abacadabrabracabracadabrabrabracad
       *  text:    abacadabrabracabracadabrabrabracad 
       *  pattern:         rab
       *
       *  % java BoyerMoore bcara abacadabrabracabracadabrabrabracad
       *  text:    abacadabrabracabracadabrabrabracad 
       *  pattern:                                   bcara
       *
       *  % java BoyerMoore rabrabracad abacadabrabracabracadabrabrabracad
       *  text:    abacadabrabracabracadabrabrabracad
       *  pattern:                        rabrabracad
       *
       *  % java BoyerMoore abacad abacadabrabracabracadabrabrabracad
       *  text:    abacadabrabracabracadabrabrabracad
       *  pattern: abacad
       *
       ******************************************************************************/
      
      /**
       *  The <tt>BoyerMoore</tt> class finds the first occurrence of a pattern string
       *  in a text string.
       *  <p>
       *  This implementation uses the Boyer-Moore algorithm (with the bad-character
       *  rule, but not the strong good suffix rule).
       *  <p>
       *  For additional documentation,
       *  see <a href="http://algs4.cs.princeton.edu/53substring">Section 5.3</a> of
       *  <i>Algorithms, 4th Edition</i> by Robert Sedgewick and Kevin Wayne.
       */
      public class BoyerMoore {
          private final int R;     // the radix
          private int[] right;     // the bad-character skip array
      
          private char[] pattern;  // store the pattern as a character array
          private String pat;      // or as a string
      
          /**
           * Preprocesses the pattern string.
           *
           * @param pat the pattern string
           */
          public BoyerMoore(String pat) {
              this.R = 256;
              this.pat = pat;
      
              // position of rightmost occurrence of c in the pattern
              right = new int[R];
              for (int c = 0; c < R; c++)
                  right[c] = -1;
              for (int j = 0; j < pat.length(); j++)
                  right[pat.charAt(j)] = j;
          }
      
          /**
           * Preprocesses the pattern string.
           *
           * @param pattern the pattern string
           * @param R the alphabet size
           */
          public BoyerMoore(char[] pattern, int R) {
              this.R = R;
              this.pattern = new char[pattern.length];
              for (int j = 0; j < pattern.length; j++)
                  this.pattern[j] = pattern[j];
      
              // position of rightmost occurrence of c in the pattern
              right = new int[R];
              for (int c = 0; c < R; c++)
                  right[c] = -1;
              for (int j = 0; j < pattern.length; j++)
                  right[pattern[j]] = j;
          }
      
          /**
           * Returns the index of the first occurrrence of the pattern string
           * in the text string.
           *
           * @param  txt the text string
           * @return the index of the first occurrence of the pattern string
           *         in the text string; N if no such match
           */
          public int search(String txt) {
              int M = pat.length();
              int N = txt.length();
              int skip;
              for (int i = 0; i <= N - M; i += skip) {
                  skip = 0;
                  for (int j = M-1; j >= 0; j--) {
                      if (pat.charAt(j) != txt.charAt(i+j)) {
                          skip = Math.max(1, j - right[txt.charAt(i+j)]);
                          break;
                      }
                  }
                  if (skip == 0) return i;    // found
              }
              return N;                       // not found
          }
      
      
          /**
           * Returns the index of the first occurrrence of the pattern string
           * in the text string.
           *
           * @param  text the text string
           * @return the index of the first occurrence of the pattern string
           *         in the text string; N if no such match
           */
          public int search(char[] text) {
              int M = pattern.length;
              int N = text.length;
              int skip;
              for (int i = 0; i <= N - M; i += skip) {
                  skip = 0;
                  for (int j = M-1; j >= 0; j--) {
                      if (pattern[j] != text[i+j]) {
                          skip = Math.max(1, j - right[text[i+j]]);
                          break;
                      }
                  }
                  if (skip == 0) return i;    // found
              }
              return N;                       // not found
          }
      
      
          /**
           * Takes a pattern string and an input string as command-line arguments;
           * searches for the pattern string in the text string; and prints
           * the first occurrence of the pattern string in the text string.
           */
          public static void main(String[] args) {
              String pat = args[0];
              String txt = args[1];
              char[] pattern = pat.toCharArray();
              char[] text    = txt.toCharArray();
      
              BoyerMoore boyermoore1 = new BoyerMoore(pat);
              BoyerMoore boyermoore2 = new BoyerMoore(pattern, 256);
              int offset1 = boyermoore1.search(txt);
              int offset2 = boyermoore2.search(text);
      
              // print results
              StdOut.println("text:    " + txt);
      
              StdOut.print("pattern: ");
              for (int i = 0; i < offset1; i++)
                  StdOut.print(" ");
              StdOut.println(pat);
      
              StdOut.print("pattern: ");
              for (int i = 0; i < offset2; i++)
                  StdOut.print(" ");
              StdOut.println(pat);
          }
      }
      
      
      Copyright © 2002–2015, Robert Sedgewick and Kevin Wayne.
      Last updated: Sat Aug 29 11:16:30 EDT 2015.