Question

我有一个拥有3000万用户ID的大文件。该大文件看起来像这样，每行都有一个用户ID。

现在，我打算从那个大文本文件中获取任何随机行。我知道该大文本文件中的用户ID总数。我不确定从该大文本文件中选择随机元素的最佳方法是什么。我想将所有这些30万个用户ID存储在一个集合中，然后从hastset中随机选择元素，但是使用这种方法会出现内存不足错误。

所以这就是原因，我试图从一个大文本文件中随机选择元素。

final String id = generateRandomUserId(random);

/**
 * Select random elements from the a big text file
 * 
 * @param userIdsSet2
 * @param r
 * @return
 */
private String generateRandomUserId(Random r) {

     File bigFile = new File("C:\\bigfile.txt");

     //randomly select elements from a big text file         


}

这样做的最佳方式是什么？

Answer 1

你可以这样做：

获取文件大小（以字节为单位）
选择一个字节（在[0..file.length（）]中随机选择的数字 - RandomAccessFile）
在文件（file.seek(number)）
在下一个\n字符（file.seek(1)）
读取行（file.readLine()）

例如......

这样您就不必存储任何东西了。

样本理论片段看起来像这样（包含一些副作用）：

File f = new File("D:/abc.txt");
RandomAccessFile file;
try {
    file = new RandomAccessFile(f, "r");
    long file_size = file.length();
    long chosen_byte = (long)(Math.random() * file_size);

    file.seek(chosen_byte);

    for (;;)
    {
        byte a_byte = file.readByte();
        char wordChar = (char)a_byte;
        if (chosen_byte >= file_size || wordChar == '\n' || wordChar == '\r' || wordChar == -1) break;
        else chosen_byte += 1;
        System.out.println("\"" + Character.toString(wordChar)  + "\"");
    }

    int chosen = -1;
    if (chosen_byte < file_size) 
    {
        String s = file.readLine();
        chosen = Integer.parseInt(s);
        System.out.println("Chosen id : \"" + s  + "\"");
        }
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
}

编辑： 完整工作（理论上）

import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.RandomAccessFile;


public class Main {

    /**
     * WARNING : This piece of code requires that the input file terminates by a BLANK line !
     * 
     * @param args
     * @throws Exception 
     */
    public static void main(String[] args) throws Exception {

        File f = new File("D:/abc.txt");
        RandomAccessFile file;

        try {

            file = new RandomAccessFile(f, "r");
            long file_size = file.length();

            // Let's start
            long chosen_byte = (long)(Math.random() * (file_size - 1));
            long cur_byte = chosen_byte;

            // Goto starting position
            file.seek(cur_byte);

            String s_LR = "";
            char a_char;

            // Get left hand chars
            for (;;)
            {
                a_char = (char)file.readByte();
                if (cur_byte < 0 || a_char == '\n' || a_char == '\r' || a_char == -1) break;
                else 
                {
                    s_LR = a_char + s_LR;
                    --cur_byte;
                    if (cur_byte >= 0) file.seek(cur_byte);
                    else break;
                }
            }

            // Get right hand chars
            cur_byte = chosen_byte + 1;
            file.seek(cur_byte);
            for (;;)
            {
                a_char = (char)file.readByte();
                if (cur_byte >= file_size || a_char == '\n' || a_char == '\r' || a_char == -1) break;
                else 
                {
                    s_LR += a_char;
                    ++cur_byte;
                }
            }

            // Parse ID
            if (cur_byte < file_size) 
            {
                int chosen_id = Integer.parseInt(s_LR);
                System.out.println("Chosen id : " + chosen_id);
            }
            else
            {
                throw new Exception("Ran out of bounds. But this usually never happen...");
            }

        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

}

希望这不是一个错误的实现（我现在更多的是C ++）...

Answer 2

不是将用户ID存储在散列中，而是可以解析文件并仅将偏移存储在int []数组中 - 30M将需要大约120MB的RAM。

或者，如果您可以通过某种方式更改或预处理文件，则可以通过填充用户ID或使用二进制格式将格式更改为固定宽度。

Answer 3

OP表示：“我知道该大文本文件中用户ID的总数”。叫这个N.

生成介于1和N之间的随机数。
读取行（BufferedReader），直到到达第N行。
完成

从大文本文件中随机选择元素

3 个答案: