Question

我有一个巨大的（> 5GB）CSV文件格式：用户名，交易

我希望为每个用户提供单独的CSV文件作为输出，只有他的所有交易都采用相同的格式。我心里想的很少，但我想听听有效（快速和内存效率）实现的其他想法。

这是我现在所做的。第一个测试是在单个线程中读取/处理/写入，第二个测试是在多个线程中。表现不是那么好，所以我觉得我做错了。请纠正我。

public class BatchFileReader {


private ICsvBeanReader beanReader;
private double total;
private String[] header;
private CellProcessor[] processors;
private DataTransformer<HashMap<String, List<LoginDto>>> processor;
private boolean hasMoreRecords = true;

public BatchFileReader(String file, DataTransformer<HashMap<String, List<LoginDto>>> processor) {
    try {
        this.processor = processor;
        this.beanReader = new CsvBeanReader(new FileReader(file), CsvPreference.STANDARD_PREFERENCE);
        header = CSVUtils.getHeader(beanReader.getHeader(true));
        processors = CSVUtils.getProcessors();
    } catch (IOException e) {
        e.printStackTrace();
    }
}

public void read() {
    try {
        readFile();
    } catch (IOException e) {
        e.printStackTrace();
    } finally {
        if (beanReader != null) {
            try {
                beanReader.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }

    }
}

private void readFile() throws IOException {
    while (hasMoreRecords) {

        long start = System.currentTimeMillis();

        HashMap<String, List<LoginDto>> usersBatch = readBatch();

        long end = System.currentTimeMillis();
        System.out.println("Reading batch for " + ((end - start) / 1000f) + " seconds.");
        total +=((end - start)/ 1000f);
        if (processor != null && !usersBatch.isEmpty()) {
            processor.transform(usersBatch);
        }
    }
    System.out.println("total = " + total);
}

private HashMap<String, List<LoginDto>> readBatch() throws IOException {
    HashMap<String, List<LoginDto>> users = new HashMap<String, List<LoginDto>>();
    int readLoginCount = 0;
    while (readLoginCount < CONFIG.READ_BATCH_SIZE) {
        LoginDto login = beanReader.read(LoginDto.class, header, processors);
        if (login != null) {
            if (!users.containsKey(login.getUsername())) {
                List<LoginDto> logins = new LinkedList<LoginDto>();
                users.put(login.getUsername(), logins);
            }
            users.get(login.getUsername()).add(login);
            readLoginCount++;
        } else {
            hasMoreRecords = false;
            break;
        }
    }   
    return users;
}

}

公共类BatchFileWriter {

private final String file;

private final List<T> processedData;

public BatchFileWriter(final String file,  List<T> processedData) {
    this.file = file;
    this.processedData = processedData;
}

public void write() {
    try {
        writeFile(file, processedData);
    } catch (IOException e) {
        e.printStackTrace();
    } finally {
    }
}

private void writeFile(final String file, final List<T> processedData) throws IOException {
    System.out.println("START WRITE " + "  " + file);
    FileWriter writer = new FileWriter(file, true);

    long start = System.currentTimeMillis();

    for (T record : processedData) {
        writer.write(record.toString());
        writer.write("\n");
    }
    writer.flush();
    writer.close();

    long end = System.currentTimeMillis();
    System.out.println("Writing in file " + file + " complete for " + ((end - start) / 1000f) + " seconds.");

}

}

公共类LoginsTest {

private static final ExecutorService executor = Executors.newSingleThreadExecutor();
private static final ExecutorService procExec = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors() + 1);

@Test
public void testSingleThreadCSVtoCSVSplit() throws InterruptedException, ExecutionException {
    long start = System.currentTimeMillis();

    DataTransformer<HashMap<String, List<LoginDto>>> simpleSplitProcessor =  new DataTransformer<HashMap<String, List<LoginDto>>>() {
        @Override
        public void transform(HashMap<String, List<LoginDto>> data) {
            for (String field : data.keySet()) {
                new BatchFileWriter<LoginDto>(field + ".csv", data.get(field)).write();
            }
        }

    };

    BatchFileReader reader = new BatchFileReader("loadData.csv", simpleSplitProcessor);
    reader.read();
    long end = System.currentTimeMillis();
    System.out.println("TOTAL " + ((end - start)/ 1000f) + " seconds.");
}

@Test
public void testMultiThreadCSVtoCSVSplit() throws InterruptedException, ExecutionException {

    long start = System.currentTimeMillis();
    System.out.println(start);

    final DataTransformer<HashMap<String, List<LoginDto>>> simpleSplitProcessor =  new DataTransformer<HashMap<String, List<LoginDto>>>() {
        @Override
        public void transform(HashMap<String, List<LoginDto>> data) {
            System.out.println("transform");
            processAsync(data);
        }
    };
    final CountDownLatch readLatch = new CountDownLatch(1);
    executor.execute(new Runnable() {
    @Override
    public void run() {
        BatchFileReader reader = new BatchFileReader("loadData.csv", simpleSplitProcessor);
        reader.read();
        System.out.println("read latch count down");
        readLatch.countDown();
    }});
    System.out.println("read latch before await");
    readLatch.await();
    System.out.println("read latch after await");
    procExec.shutdown();
    executor.shutdown();
    long end = System.currentTimeMillis();
    System.out.println("TOTAL " + ((end - start)/ 1000f) + " seconds.");

}


private void processAsync(final HashMap<String, List<LoginDto>> data) {
    procExec.execute(new Runnable() {
        @Override
        public void run() {
            for (String field : data.keySet()) {
                writeASync(field, data.get(field));
            }
        }

    });     
}

private void writeASync(final String field, final List<LoginDto> data) {
    procExec.execute(new Runnable() {
        @Override
        public void run() {

            new BatchFileWriter<LoginDto>(field + ".csv", data).write();    
        }
    });
}

}

Answer 1

使用unix命令排序然后拆分原始文件会不会更好？

类似的东西： cat txn.csv |排序＆gt; TXN-sorted.csv

从那里通过grep获取唯一用户名列表，然后grep每个用户名的已排序文件

Answer 2

如果你已经了解Camel，我会写一个简单的Camel路线：从文件中读取行解析线写入正确的输出文件

这是一个非常简单的路线，但是如果你想尽可能快的话，那么很容易让它成为多线程

例如，您的路线看起来像：

from("file:/myfile.csv")
.beanRef("lineParser")
.to("seda:internal-queue");

from("seda:internal-queue")
.concurrentConsumers(5)
.to("fileWriter");

如果你不了解Camel，那么不值得学习这一项任务。但是，您可能需要使其多线程以获得最大性能。你必须尝试最好的线程，因为它取决于操作的哪些部分是最慢的。

多线程将占用更多内存，因此您需要平衡内存效率与性能。

Answer 3

我会为每个用户打开/附加一个新的输出文件。如果您想最大限度地减少内存使用并产生更多的I / O开销，您可以执行以下操作，但您可能希望使用真正的CSV解析器，如Super CSV（http://supercsv.sourceforge.net/index.html）：

Scanner s = new Scanner(new File("/my/dir/users-and-transactions.txt"));
while (s.hasNextLine()) {
    String line = s.nextLine();
    String[] tokens = line.split(",");
    String user = tokens[0];
    String transaction = tokens[1];
    PrintStream out = new PrintStream(new FileOutputStream("/my/dir/" + user, true));
    out.println(transaction);
    out.close();
}
s.close();

如果你有合理的内存量，你可以为OutputStream创建一个用户名Map。每次看到用户字符串时，您都可以获取该用户名的现有OutputStream，或者如果不存在则创建新的OutputStream。

通过自定义过滤器拆分巨大的CSV？

3 个答案: