Question

是否有比创建符合以下条件的流式文件阅读器类更好的[预先存在的可选Java 1.6]解决方案？

给定一个任意大小的ASCII文件，其中每一行以\n
对于某个方法readLine()的每次调用，从文件中读取一个随机行
对于文件句柄的生命周期，对readLine()的调用不应该返回相同的行两次

更新

最终必须阅读所有行

上下文：文件的内容是从Unix shell命令创建的，以获取给定目录中包含的所有路径的目录列表;有数百万到十亿个文件（在目标文件中产生数百万到十亿行）。如果有一些方法可以在创建时间内将路径随机分配到文件中，这也是一种可接受的解决方案。

Answer 1

为了避免读取整个文件（在您的情况下可能无法读取），您可能需要使用RandomAccessFile而不是标准的java FileInputStream。使用RandomAccessFile，您可以使用seek(long position)方法跳转到文件中的任意位置并开始阅读。代码看起来像这样。

RandomAccessFile raf = new RandomAccessFile("path-to-file","rw");
HashMap<Integer,String> sampledLines = new HashMap<Integer,String>();
for(int i = 0; i < numberOfRandomSamples; i++)
{
    //seek to a random point in the file
    raf.seek((long)(Math.random()*raf.length()));

    //skip from the random location to the beginning of the next line
    int nextByte = raf.read();
    while(((char)nextByte) != '\n')
    {
        if(nextByte == -1) raf.seek(0);//wrap around to the beginning of the file if you reach the end
        nextByte = raf.read();
    }

    //read the line into a buffer
    StringBuffer lineBuffer = new StringBuffer();
    nextByte = raf.read();
    while(nextByte != -1 && (((char)nextByte) != '\n'))
        lineBuffer.append((char)nextByte);

    //ensure uniqueness
    String line = lineBuffer.toString();
    if(sampledLines.get(line.hashCode()) != null)
        i--;
    else
       sampledLines.put(line.hashCode(),line);
}

在这里，sampledLines应该在最后保留随机选择的行。您可能需要检查是否已经没有随机跳过文件末尾以避免出现错误。

编辑：我将文件打包到文件的开头，以防您到达目的地。这是一个非常简单的检查。

编辑2：我使用HashMap验证了行的唯一性。

Answer 2

预处理输入文件并记住每个新行的偏移量。使用BitSet跟踪已用线路。如果你想节省一些内存，那么记住每16行的偏移量;它仍然很容易跳入文件并在16行的块内进行顺序查找。

Answer 3

既然你可以填充这些线条，我会按照这些线条做一些事情，你也应该注意到，即使这样，List实际上可能存在一个限制。

每次想要读取该行并将其添加到Set时使用随机数也可以，但这可以确保完全读取该文件：

public class VeryLargeFileReading
    implements Iterator<String>, Closeable
{
    private static Random RND = new Random();
    // List of all indices
    final List<Long> indices = new ArrayList<Long>();
    final RandomAccessFile fd;

    public VeryLargeFileReading(String fileName, long lineSize)
    {
        fd = new RandomAccessFile(fileName);
        long nrLines = fd.length() / lineSize;
        for (long i = 0; i < nrLines; i++)
            indices.add(i * lineSize);
        Collections.shuffle(indices);
    }

    // Iterator methods
    @Override
    public boolean hasNext()
    {
        return !indices.isEmpty();
    }

    @Override
    public void remove()
    {
        // Nope
        throw new IllegalStateException();
    }

    @Override
    public String next()
    {
        final long offset = indices.remove(0);
        fd.seek(offset);
        return fd.readLine().trim();
    }

    @Override
    public void close() throws IOException
    {
        fd.close();
    }
}

Answer 4

如果文件的数量真正是任意的，那么在内存使用方面跟踪已处理文件可能存在相关问题（如果在文件中跟踪而不是列表或集合，则会出现IO时间）。保持增长所选行列表的解决方案也会遇到与时序相关的问题。

我会考虑以下几点：

创建 n “bucket”文件。可以基于考虑文件数量和系统内存的内容来确定 n 。（如果 n 很大，您可以生成 n 的子集以保持打开文件句柄。）
对每个文件的名称进行哈希处理，然后进入相应的存储桶文件，根据任意条件“分割”目录。
读入存储桶文件内容（只是文件名）并按原样处理（由散列机制提供随机性），或选择rnd（n）并随时删除，提供更多随机性。
或者，你可以填充并使用随机访问的想法，从列表中删除索引/偏移量。

Java：使用状态的ASCII随机行文件访问

4 个答案: