Question

我有这个代码，用于确定wordList文本文件中是否包含单词（忽略大小写）。但是，wordList文本文件可能有65000 ++行，并且只使用我的实现在下面搜索一个单词需要将近一分钟。你能想到更好的实施吗？

谢谢！

import java.io.*;
import java.util.*;

public class WordSearch 
{
    LinkedList<String> lxx;
    FileReader fxx;
    BufferedReader bxx;

    public WordSearch(String wordlist) 
        throws IOException
    {
        fxx = new FileReader(wordlist);
        bxx = new BufferedReader(fxx);
        lxx = new LinkedList<String>();
        String word;

        while ( (word = bxx.readLine()) != null) 
            {
            lxx.add(word);
        }

        bxx.close();
    }

    public boolean inTheList (String theWord)
    {
        for(int i =0 ; i < lxx.size(); i++)
            {
            if (theWord.compareToIgnoreCase(lxx.get(i)) == 0)
                    {
                return true;
            }
        }

        return false;
    }
}

Answer 1

使用HashSet将每个单词的小写版本放入其中。检查HashSet是否包含指定的字符串，平均而言是一个恒定时间（读取：极快）操作。

Answer 2

由于您正在搜索，您可能需要考虑在搜索之前对列表进行排序;然后你可以做二分搜索，这比线性搜索要快得多。如果您在同一个列表上执行多次搜索，这可能会有所帮助，否则您为列表排序所付出的代价不值得搜索一次。

此外，使用“lxx.get（i）”对链接列表进行线性搜索会产生麻烦。 LinkedList.get（）是O（n）。您可以使用Iterator（简单方法：for（String s：lxx））或切换到具有O（1）访问时间的列表类型，例如ArrayList。

Answer 3

每次在O（n）操作中搜索l，因此当你有数千个单词时，这将付出相当大的代价。相反，请使用HashSet：

Set<String> lxx;

...

lxx = new HashSet<String>();
while ( (word = bxx.readLine()) != null) {
        lxx.add(word.toLowerCase());
}
bxx.close();

然后使用lxx.contains(theWord.toLowerCase())检查该单词是否在文件中。 HashSet中的每个查找都是O（1）操作，因此所需的时间（几乎）与文件大小无关。

Answer 4

首先，不要将您的变量声明为LinkedList，将其声明为List（与删除列表无关的部分代码：

public class WordSearch 
{
    List<String> lxx;

    public WordSearch(String wordlist) 
        throws IOException
    {
        lxx = new LinkedList<String>();
    }
}

接下来不要调用get列表，使用LinkedList获取将非常慢。而是使用迭代器...更好地使用新的stype for循环，它为你使用迭代器：

    public boolean inTheList (String theWord)
    {
        for(String word : lxx)
        {
            if (theWord.compareToIgnoreCase(word) == 0)
            {
                return true;
            }
        }

        return false;
    }

接下来将新的LinkedList更改为新的ArrayList：

lxx = new ArrayList（）;

此代码应该更快，但您仍然可以做得更好。

由于您不关心重复单词，请使用Set而不是List并使用HashSet而不是ArrayList。

这样做会大大加快程序的速度。

每次搜索列表中的下一个单词时，使用带有get的LinkedList的原始代码必须在列表的开头重新开始。使用迭代器（通过每个循环的新样式）可以阻止这种情况发生。

使用LinkedList意味着每次必须转到列表中的下一个单词时都会涉及查找，ArrayList没有那个开销。

使用HashSet（可能）使用具有非常快速查找的树结构。

Answer 5

这是我在50毫秒内搜索的实现。

首先，您必须加载文件并将其保存在内存中。

您可以根据需要加载它，但如果以大块的形式加载它会更容易。

我的输入是byte into python book（下载HTML单文件版本）和Java language specification（下载html并从所有html页面中创建单个文件）

要将列表创建为大文件，我使用了相同的程序（请参阅注释代码）。

一旦我有一个大约300k字的大文件，我用这个输出运行程序：

C:\Users\oreyes\langs\java\search>dir singlelineInput.txt
 El volumen de la unidad C no tiene etiqueta.
 El número de serie del volumen es: 22A8-203B

 Directorio de C:\Users\oreyes\langs\java\search

04/03/2011  09:37 p.m.         3,898,345 singlelineInput.txt
               1 archivos      3,898,345 bytes

C:\Users\oreyes\langs\java\search>javac WordSearch.java

C:\Users\oreyes\langs\java\search>java WordSearch singlelineInput.txt "great"
Loaded 377381 words in 2844 ms
true
in 31 ms

C:\Users\oreyes\langs\java\search>java WordSearch singlelineInput.txt "great"
Loaded 377381 words in 2812 ms
true
in 31 ms

C:\Users\oreyes\langs\java\search>java WordSearch singlelineInput.txt "awesome"
Loaded 377381 words in 2813 ms
false
in 47 ms

C:\Users\oreyes\langs\java\search>gvim singlelineInput.txt

C:\Users\oreyes\langs\java\search>java WordSearch singlelineInput.txt "during"
Loaded 377381 words in 2813 ms
true
in 15 ms

C:\Users\oreyes\langs\java\search>java WordSearch singlelineInput.txt "specification"
Loaded 377381 words in 2875 ms
true
in 47 ms

C:\Users\oreyes\langs\java\search>java WordSearch singlelineInput.txt "<href"
Loaded 377381 words in 2844 ms
false
in 47 ms

C:\Users\oreyes\langs\java\search>java WordSearch singlelineInput.txt "<br>"
Loaded 377381 words in 2829 ms
true
in 15 ms

始终低于50毫秒。

以下是代码：

   import java.io.*;
   import java.util.*;

   class WordSearch {
       String inputFile;
       List<String> words;
       public WordSearch(String file ) { 
           inputFile = file;
       }
       public void initialize() throws IOException { 
           long start = System.currentTimeMillis();
           File file = new File( inputFile );
           ByteArrayOutputStream baos = new ByteArrayOutputStream(( int ) file.length());
           FileInputStream in = new FileInputStream( file );
           copyLarge( in, baos, (int)file.length() );

           Scanner scanner = new Scanner( new ByteArrayInputStream(  baos.toByteArray() ));
           words = new LinkedList<String>();
           while( scanner.hasNextLine() ) { 
              String l = scanner.nextLine().trim();
              //for( String s : l.split("\\s+")){
                //System.out.println( s );
                words.add( l.toLowerCase() );
              //}
           }

           Collections.sort( words );
           for( String s : words ) { 
               //System.out.println( s );
           }
           System.out.println("Loaded " + words.size() + " words in "+  ( System.currentTimeMillis() - start ) + " ms"  );
       }

       public boolean contains( String aWord ) { 
           return words.contains( aWord.toLowerCase() );
       }
        // taken from:  http://stackoverflow.com/questions/326390/how-to-create-a-java-string-from-the-contents-of-a-file/326413#326413 
        public static long copyLarge(InputStream input, OutputStream output, int size )
               throws IOException {
           byte[] buffer = new byte[size];// something biggie 
           long count = 0;
           int n = 0;
           while (-1 != (n = input.read(buffer))) {
               output.write(buffer, 0, n);
               count += n;
           }
           return count;
       }
       public static void main( String ... args ) throws IOException  { 
           WordSearch ws = new WordSearch( args[0] );
           ws.initialize();
           long start = System.currentTimeMillis();
           System.out.println( ws.contains( args[1] ) );
           System.out.println("in "+  ( System.currentTimeMillis() - start ) +" ms ");

       }
    }

困难的部分是获得样本输入：P

Answer 6

猜猜是什么，使用HashMap立即返回：

这是修改后的版本，它总是在0毫秒内完成。

   import java.io.*;
   import java.util.*;

   class WordSearch {
       String inputFile;
       //List<String> words;
       Set<String> words;
       public WordSearch(String file ) { 
           inputFile = file;
       }
       public void initialize() throws IOException { 
           long start = System.currentTimeMillis();
           File file = new File( inputFile );
           ByteArrayOutputStream baos = new ByteArrayOutputStream(( int ) file.length());
           FileInputStream in = new FileInputStream( file );
           copyLarge( in, baos, (int)file.length() );

           Scanner scanner = new Scanner( new ByteArrayInputStream(  baos.toByteArray() ));
           words = new HashSet<String>();
           while( scanner.hasNextLine() ) { 
              String l = scanner.nextLine().trim();
              //for( String s : l.split("\\s+")){
                //System.out.println( s );
                words.add( l.toLowerCase() );
              //}
           }

           //Collections.sort( words );
           for( String s : words ) { 
               System.out.println( s );
           }
           System.out.println("Loaded " + words.size() + " words in "+  ( System.currentTimeMillis() - start ) + " ms"  );
       }

       public boolean contains( String aWord ) { 
           return words.contains( aWord.toLowerCase() );
       }

        public static long copyLarge(InputStream input, OutputStream output, int size )
               throws IOException {
           byte[] buffer = new byte[size];// something biggie 
           long count = 0;
           int n = 0;
           while (-1 != (n = input.read(buffer))) {
               output.write(buffer, 0, n);
               count += n;
           }
           return count;
       }
       public static void main( String ... args ) throws IOException  { 
           WordSearch ws = new WordSearch( args[0] );
           ws.initialize();
           long start = System.currentTimeMillis();
           System.out.println( ws.contains( args[1] ) );
           System.out.println("in "+  ( System.currentTimeMillis() - start ) +" ms ");

       }
    }

现在我肯定知道:)

Answer 7

两个建议：这两种数据结构都可以提供更好的性能。

定向非循环字图（DAWG）
字典数据结构。正树

用于搜索字符串的更快的数据结构

7 个答案: