Question

我想读取150 MB的文本文件，并将文件的内容拆分为单词。当我使用MappedByteBuffer执行此操作时，其文件大小为135 MB，需要12秒。当我对BufferedReader做同样的事情时，它需要更多的时间。是否可以缩短时间？

这是我的代码。

import java.io.BufferedReader;
import java.io.ByteArrayInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.util.concurrent.ConcurrentHashMap;


public class mappedcompare {

    public static void main(String[] args) throws IOException {
        // TODO Auto-generated method stub
        long one =System.currentTimeMillis();
        String line=null;



        File f= new File("D:\\dinesh\\janani.txt");
        FileInputStream fin = new FileInputStream(f);
        FileChannel fc = fin.getChannel();
        MappedByteBuffer mbb = fc.map(FileChannel.MapMode.READ_ONLY, 0L, fc.size());
        String[] words=null;
        ConcurrentHashMap <String,Integer> dictionary=new ConcurrentHashMap<String,Integer>(50,1);
        byte[] buffer = new byte[(int) fc.size()];
        mbb.get(buffer);
        ByteArrayInputStream isr = new ByteArrayInputStream(buffer);
        InputStreamReader ip = new InputStreamReader(isr);
        BufferedReader br = new BufferedReader(ip);
        while((line=br.readLine())!=null){
            line=line.replace(':', ' ');
            line=line.replace(';', ' ');
            line=line.replace('"', ' ');
            line=line.replace('!', ' ');
            line=line.replace(',',' ');
            line=line.replace('.', ' ');
            line =line.replace('/', ' ');
            line=line.replace('\\', ' ');
            line=line.replace('%', ' ');
            line=line.replace('(', ' ');
            line=line.replace(')', ' ');
            line=line.replace('\'', ' ');
        for(String word: line.split("\\s+"))
                {
            dictionary.putIfAbsent(word, 1);

            if(dictionary.containsKey("word")){
                    int value =dictionary.get(word);
                    dictionary.replace(word, ++value);  
                }

                }
        }
    System.out.println(System.currentTimeMillis() - one);
    fin.close();

    }

}

Answer 1

首先，不要在单线程操作中使用ConcurrentHashMap。使用此类比简单HashMap没有任何好处。在Java 7中，HashMap不提供操作putIfAbsent等，但这不是限制，而是有机会清除Map更新代码：

dictionary.putIfAbsent(word, 1);

if(dictionary.containsKey("word")){
        int value =dictionary.get(word);
        dictionary.replace(word, ++value);  
    }

在这里，您正在执行四个哈希查找操作putIfAbsent，containsKey，get和replace，其中您实际只需要两个（除了寻找的点之外） "word"而不是word对我来说很糟糕）：

Integer old=dictionary.get(word);
dictionary.put(word, old==null? 1: old+1);

这只需要两次查找，并且可以使用普通的HashMap。

接下来，摆脱line=line.replace(…, ' ');次调用的序列，因为他们每个人都会创建一个新的String，您真正想要的就是在' '中处理这些特殊字符，例如split 1}}操作。因此，您只需调整split操作即可将这些字符视为分隔符：for(String word: line.split("[:;\"!,./\\\\%()'\\s]+"))。

所以把它们放在一起你的代码变得更具可读性，这比你可以节省的几秒钟更大。

File f= new File("D:\\dinesh\\janani.txt");
try(FileInputStream fin = new FileInputStream(f);
    FileChannel fc = fin.getChannel();) {
  final MappedByteBuffer mbb = fc.map(FileChannel.MapMode.READ_ONLY, 0L, fc.size());
  HashMap<String, Integer> dictionary=new HashMap<>();
  byte[] buffer = new byte[(int) fc.size()];
  mbb.get(buffer);
  ByteArrayInputStream isr = new ByteArrayInputStream(buffer);
  InputStreamReader ip = new InputStreamReader(isr);
  BufferedReader br = new BufferedReader(ip);
  while((line=br.readLine())!=null){
    for(String word: line.split("[:;\"!,./\\\\%()'\\s]+")) {
      Integer old=dictionary.get(word);
      dictionary.put(word, old==null? 1: old+1);
    }
  }
}

最后，我建议您尝试Files.readAllLines(…)。这将取决于环境是否更快，但即使它稍微慢一点，我希望它比你的MappedByteBuffer方法更好，因为可读性获胜：

File f= new File("D:\\dinesh\\janani.txt");
HashMap<String, Integer> dictionary=new HashMap<>();
for(String line:Files.readAllLines(f.toPath(), Charset.defaultCharset())) {
  for(String word: line.split("[:;\"!,./\\\\%()'\\s]+")) {
    Integer old=dictionary.get(word);
    dictionary.put(word, old==null? 1: old+1);
  }
}

如果性能非常重要，您可以更深入一级，并在byte级别上手动拆分并仅在找到匹配后创建String。这假定您使用的编码使用byte每个char并直接映射较低的值（即ASCII字符），这是常见编码的情况，如窗口CP1258。

HashMap<String, Integer> dictionary=new HashMap<>();
final CharsetDecoder cs = Charset.defaultCharset().newDecoder();
assert cs.averageCharsPerByte()==1;
try(FileChannel ch=FileChannel.open(f.toPath(), StandardOpenOption.READ)) {
  MappedByteBuffer mbb=ch.map(MapMode.READ_ONLY, 0, ch.size());
  ByteBuffer slice=mbb.asReadOnlyBuffer();
  int start=0;
  while(mbb.hasRemaining()) {
    switch(mbb.get()) {
      case ' ': case   9: case   10: case  11: case  13: case '\f':
      case ':': case ';': case '\\': case '"': case '!': case ',':
      case '.': case '/': case  '%': case '(': case ')': case '\'':
        int pos=mbb.position();
        if(pos>start) {
          slice.limit(mbb.position()).position(start);
          String word=cs.decode(slice).toString();
          Integer old=dictionary.get(word);
          dictionary.put(word, old==null? 1: old+1);
          start=mbb.position();
        }
        start=pos+1;
    }
  }
}

这可以显着加速这种低级操作，但代价是不完全可移植。

Answer 2

我尽力减少尽可能多的操作次数。对于我创建的示例文件，这最终比原始代码快3倍。这可能不适用于大多数更复杂的字符编码（请参阅Holger的替代方法答案，它应该适用于任何字符编码）。

long one = System.currentTimeMillis();

boolean[] isDelimiter = new boolean[127];
isDelimiter[' '] = true;
isDelimiter['\t'] = true;
isDelimiter[':'] = true;
isDelimiter[';'] = true;
isDelimiter['"'] = true;
isDelimiter['!'] = true;
isDelimiter[','] = true;
isDelimiter['.'] = true;
isDelimiter['/'] = true;
isDelimiter['\\'] = true;
isDelimiter['%'] = true;
isDelimiter['('] = true;
isDelimiter[')'] = true;
isDelimiter['\''] = true;
isDelimiter['\r'] = true;
isDelimiter['\n'] = true;

class Counter {

  int count = 0;
}

File f = // your file here
FileInputStream fin = new FileInputStream(f);
FileChannel fc = fin.getChannel();
MappedByteBuffer mbb = fc
    .map(FileChannel.MapMode.READ_ONLY, 0L, f.length());
Map<String, Counter> dictionary = new HashMap<String, Counter>();

StringBuilder wordBuilder = new StringBuilder();
while (mbb.hasRemaining()) {
  char c = (char) mbb.get();
  if (c < isDelimiter.length && c >= 0 && isDelimiter[c]) {
    if (wordBuilder.length() > 0) {
      String word = wordBuilder.toString();
      wordBuilder.setLength(0);

      Counter intForWord = dictionary.get(word);
      if (intForWord == null) {
        intForWord = new Counter();
        dictionary.put(word, intForWord);
      }
      intForWord.count++;
    }
  } else {
    wordBuilder.append(c);
  }
}

System.out.println(System.currentTimeMillis() - one);
fin.close();

Answer 3

尝试用

替换所有replace和split s

line.split("[:;\"!,./\\\\%()'\\s]+")

您还可以尝试使用Java的Scanner在流式传输时解析文件。您可以将上述正则表达式传递给useDelimiter，以便它可以分割所有这些字符。

MappedByteBuffer查询

3 个答案: