通过自定义过滤器拆分巨大的CSV?

时间:2014-07-04 13:16:09

标签: java multithreading parsing batch-file file-io

我有一个巨大的(> 5GB)CSV文件格式:     用户名,交易

我希望为每个用户提供单独的CSV文件作为输出,只有他的所有交易都采用相同的格式。我心里想的很少,但我想听听有效(快速和内存效率)实现的其他想法。

这是我现在所做的。第一个测试是在单个线程中读取/处理/写入,第二个测试是在多个线程中。表现不是那么好,所以我觉得我做错了。请纠正我。

public class BatchFileReader {


private ICsvBeanReader beanReader;
private double total;
private String[] header;
private CellProcessor[] processors;
private DataTransformer<HashMap<String, List<LoginDto>>> processor;
private boolean hasMoreRecords = true;

public BatchFileReader(String file, DataTransformer<HashMap<String, List<LoginDto>>> processor) {
    try {
        this.processor = processor;
        this.beanReader = new CsvBeanReader(new FileReader(file), CsvPreference.STANDARD_PREFERENCE);
        header = CSVUtils.getHeader(beanReader.getHeader(true));
        processors = CSVUtils.getProcessors();
    } catch (IOException e) {
        e.printStackTrace();
    }
}

public void read() {
    try {
        readFile();
    } catch (IOException e) {
        e.printStackTrace();
    } finally {
        if (beanReader != null) {
            try {
                beanReader.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }

    }
}

private void readFile() throws IOException {
    while (hasMoreRecords) {

        long start = System.currentTimeMillis();

        HashMap<String, List<LoginDto>> usersBatch = readBatch();

        long end = System.currentTimeMillis();
        System.out.println("Reading batch for " + ((end - start) / 1000f) + " seconds.");
        total +=((end - start)/ 1000f);
        if (processor != null && !usersBatch.isEmpty()) {
            processor.transform(usersBatch);
        }
    }
    System.out.println("total = " + total);
}

private HashMap<String, List<LoginDto>> readBatch() throws IOException {
    HashMap<String, List<LoginDto>> users = new HashMap<String, List<LoginDto>>();
    int readLoginCount = 0;
    while (readLoginCount < CONFIG.READ_BATCH_SIZE) {
        LoginDto login = beanReader.read(LoginDto.class, header, processors);
        if (login != null) {
            if (!users.containsKey(login.getUsername())) {
                List<LoginDto> logins = new LinkedList<LoginDto>();
                users.put(login.getUsername(), logins);
            }
            users.get(login.getUsername()).add(login);
            readLoginCount++;
        } else {
            hasMoreRecords = false;
            break;
        }
    }   
    return users;
}

}

公共类BatchFileWriter {

private final String file;

private final List<T> processedData;

public BatchFileWriter(final String file,  List<T> processedData) {
    this.file = file;
    this.processedData = processedData;
}

public void write() {
    try {
        writeFile(file, processedData);
    } catch (IOException e) {
        e.printStackTrace();
    } finally {
    }
}

private void writeFile(final String file, final List<T> processedData) throws IOException {
    System.out.println("START WRITE " + "  " + file);
    FileWriter writer = new FileWriter(file, true);

    long start = System.currentTimeMillis();

    for (T record : processedData) {
        writer.write(record.toString());
        writer.write("\n");
    }
    writer.flush();
    writer.close();

    long end = System.currentTimeMillis();
    System.out.println("Writing in file " + file + " complete for " + ((end - start) / 1000f) + " seconds.");

}

}

公共类LoginsTest {

private static final ExecutorService executor = Executors.newSingleThreadExecutor();
private static final ExecutorService procExec = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors() + 1);

@Test
public void testSingleThreadCSVtoCSVSplit() throws InterruptedException, ExecutionException {
    long start = System.currentTimeMillis();

    DataTransformer<HashMap<String, List<LoginDto>>> simpleSplitProcessor =  new DataTransformer<HashMap<String, List<LoginDto>>>() {
        @Override
        public void transform(HashMap<String, List<LoginDto>> data) {
            for (String field : data.keySet()) {
                new BatchFileWriter<LoginDto>(field + ".csv", data.get(field)).write();
            }
        }

    };

    BatchFileReader reader = new BatchFileReader("loadData.csv", simpleSplitProcessor);
    reader.read();
    long end = System.currentTimeMillis();
    System.out.println("TOTAL " + ((end - start)/ 1000f) + " seconds.");
}

@Test
public void testMultiThreadCSVtoCSVSplit() throws InterruptedException, ExecutionException {

    long start = System.currentTimeMillis();
    System.out.println(start);

    final DataTransformer<HashMap<String, List<LoginDto>>> simpleSplitProcessor =  new DataTransformer<HashMap<String, List<LoginDto>>>() {
        @Override
        public void transform(HashMap<String, List<LoginDto>> data) {
            System.out.println("transform");
            processAsync(data);
        }
    };
    final CountDownLatch readLatch = new CountDownLatch(1);
    executor.execute(new Runnable() {
    @Override
    public void run() {
        BatchFileReader reader = new BatchFileReader("loadData.csv", simpleSplitProcessor);
        reader.read();
        System.out.println("read latch count down");
        readLatch.countDown();
    }});
    System.out.println("read latch before await");
    readLatch.await();
    System.out.println("read latch after await");
    procExec.shutdown();
    executor.shutdown();
    long end = System.currentTimeMillis();
    System.out.println("TOTAL " + ((end - start)/ 1000f) + " seconds.");

}


private void processAsync(final HashMap<String, List<LoginDto>> data) {
    procExec.execute(new Runnable() {
        @Override
        public void run() {
            for (String field : data.keySet()) {
                writeASync(field, data.get(field));
            }
        }

    });     
}

private void writeASync(final String field, final List<LoginDto> data) {
    procExec.execute(new Runnable() {
        @Override
        public void run() {

            new BatchFileWriter<LoginDto>(field + ".csv", data).write();    
        }
    });
}

}

3 个答案:

答案 0 :(得分:1)

使用unix命令排序然后拆分原始文件会不会更好?

类似的东西: cat txn.csv |排序&gt; TXN-sorted.csv

从那里通过grep获取唯一用户名列表,然后grep每个用户名的已排序文件

答案 1 :(得分:1)

如果你已经了解Camel,我会写一个简单的Camel路线: 从文件中读取行 解析线 写入正确的输出文件

这是一个非常简单的路线,但是如果你想尽可能快的话,那么很容易让它成为多线程

例如,您的路线看起来像:

from("file:/myfile.csv")
.beanRef("lineParser")
.to("seda:internal-queue");

from("seda:internal-queue")
.concurrentConsumers(5)
.to("fileWriter");

如果你不了解Camel,那么不值得学习这一项任务。但是,您可能需要使其多线程以获得最大性能。你必须尝试最好的线程,因为它取决于操作的哪些部分是最慢的。

多线程将占用更多内存,因此您需要平衡内存效率与性能。

答案 2 :(得分:0)

我会为每个用户打开/附加一个新的输出文件。如果您想最大限度地减少内存使用并产生更多的I / O开销,您可以执行以下操作,但您可能希望使用真正的CSV解析器,如Super CSV(http://supercsv.sourceforge.net/index.html):

Scanner s = new Scanner(new File("/my/dir/users-and-transactions.txt"));
while (s.hasNextLine()) {
    String line = s.nextLine();
    String[] tokens = line.split(",");
    String user = tokens[0];
    String transaction = tokens[1];
    PrintStream out = new PrintStream(new FileOutputStream("/my/dir/" + user, true));
    out.println(transaction);
    out.close();
}
s.close();

如果你有合理的内存量,你可以为OutputStream创建一个用户名Map。每次看到用户字符串时,您都可以获取该用户名的现有OutputStream,或者如果不存在则创建新的OutputStream。