我有一个巨大的(> 5GB)CSV文件格式: 用户名,交易
我希望为每个用户提供单独的CSV文件作为输出,只有他的所有交易都采用相同的格式。我心里想的很少,但我想听听有效(快速和内存效率)实现的其他想法。
这是我现在所做的。第一个测试是在单个线程中读取/处理/写入,第二个测试是在多个线程中。表现不是那么好,所以我觉得我做错了。请纠正我。
public class BatchFileReader {
private ICsvBeanReader beanReader;
private double total;
private String[] header;
private CellProcessor[] processors;
private DataTransformer<HashMap<String, List<LoginDto>>> processor;
private boolean hasMoreRecords = true;
public BatchFileReader(String file, DataTransformer<HashMap<String, List<LoginDto>>> processor) {
try {
this.processor = processor;
this.beanReader = new CsvBeanReader(new FileReader(file), CsvPreference.STANDARD_PREFERENCE);
header = CSVUtils.getHeader(beanReader.getHeader(true));
processors = CSVUtils.getProcessors();
} catch (IOException e) {
e.printStackTrace();
}
}
public void read() {
try {
readFile();
} catch (IOException e) {
e.printStackTrace();
} finally {
if (beanReader != null) {
try {
beanReader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
private void readFile() throws IOException {
while (hasMoreRecords) {
long start = System.currentTimeMillis();
HashMap<String, List<LoginDto>> usersBatch = readBatch();
long end = System.currentTimeMillis();
System.out.println("Reading batch for " + ((end - start) / 1000f) + " seconds.");
total +=((end - start)/ 1000f);
if (processor != null && !usersBatch.isEmpty()) {
processor.transform(usersBatch);
}
}
System.out.println("total = " + total);
}
private HashMap<String, List<LoginDto>> readBatch() throws IOException {
HashMap<String, List<LoginDto>> users = new HashMap<String, List<LoginDto>>();
int readLoginCount = 0;
while (readLoginCount < CONFIG.READ_BATCH_SIZE) {
LoginDto login = beanReader.read(LoginDto.class, header, processors);
if (login != null) {
if (!users.containsKey(login.getUsername())) {
List<LoginDto> logins = new LinkedList<LoginDto>();
users.put(login.getUsername(), logins);
}
users.get(login.getUsername()).add(login);
readLoginCount++;
} else {
hasMoreRecords = false;
break;
}
}
return users;
}
}
公共类BatchFileWriter {
private final String file;
private final List<T> processedData;
public BatchFileWriter(final String file, List<T> processedData) {
this.file = file;
this.processedData = processedData;
}
public void write() {
try {
writeFile(file, processedData);
} catch (IOException e) {
e.printStackTrace();
} finally {
}
}
private void writeFile(final String file, final List<T> processedData) throws IOException {
System.out.println("START WRITE " + " " + file);
FileWriter writer = new FileWriter(file, true);
long start = System.currentTimeMillis();
for (T record : processedData) {
writer.write(record.toString());
writer.write("\n");
}
writer.flush();
writer.close();
long end = System.currentTimeMillis();
System.out.println("Writing in file " + file + " complete for " + ((end - start) / 1000f) + " seconds.");
}
}
公共类LoginsTest {
private static final ExecutorService executor = Executors.newSingleThreadExecutor();
private static final ExecutorService procExec = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors() + 1);
@Test
public void testSingleThreadCSVtoCSVSplit() throws InterruptedException, ExecutionException {
long start = System.currentTimeMillis();
DataTransformer<HashMap<String, List<LoginDto>>> simpleSplitProcessor = new DataTransformer<HashMap<String, List<LoginDto>>>() {
@Override
public void transform(HashMap<String, List<LoginDto>> data) {
for (String field : data.keySet()) {
new BatchFileWriter<LoginDto>(field + ".csv", data.get(field)).write();
}
}
};
BatchFileReader reader = new BatchFileReader("loadData.csv", simpleSplitProcessor);
reader.read();
long end = System.currentTimeMillis();
System.out.println("TOTAL " + ((end - start)/ 1000f) + " seconds.");
}
@Test
public void testMultiThreadCSVtoCSVSplit() throws InterruptedException, ExecutionException {
long start = System.currentTimeMillis();
System.out.println(start);
final DataTransformer<HashMap<String, List<LoginDto>>> simpleSplitProcessor = new DataTransformer<HashMap<String, List<LoginDto>>>() {
@Override
public void transform(HashMap<String, List<LoginDto>> data) {
System.out.println("transform");
processAsync(data);
}
};
final CountDownLatch readLatch = new CountDownLatch(1);
executor.execute(new Runnable() {
@Override
public void run() {
BatchFileReader reader = new BatchFileReader("loadData.csv", simpleSplitProcessor);
reader.read();
System.out.println("read latch count down");
readLatch.countDown();
}});
System.out.println("read latch before await");
readLatch.await();
System.out.println("read latch after await");
procExec.shutdown();
executor.shutdown();
long end = System.currentTimeMillis();
System.out.println("TOTAL " + ((end - start)/ 1000f) + " seconds.");
}
private void processAsync(final HashMap<String, List<LoginDto>> data) {
procExec.execute(new Runnable() {
@Override
public void run() {
for (String field : data.keySet()) {
writeASync(field, data.get(field));
}
}
});
}
private void writeASync(final String field, final List<LoginDto> data) {
procExec.execute(new Runnable() {
@Override
public void run() {
new BatchFileWriter<LoginDto>(field + ".csv", data).write();
}
});
}
}
答案 0 :(得分:1)
使用unix命令排序然后拆分原始文件会不会更好?
类似的东西: cat txn.csv |排序&gt; TXN-sorted.csv 强>
从那里通过grep获取唯一用户名列表,然后grep每个用户名的已排序文件
答案 1 :(得分:1)
如果你已经了解Camel,我会写一个简单的Camel路线: 从文件中读取行 解析线 写入正确的输出文件
这是一个非常简单的路线,但是如果你想尽可能快的话,那么很容易让它成为多线程
例如,您的路线看起来像:
from("file:/myfile.csv")
.beanRef("lineParser")
.to("seda:internal-queue");
from("seda:internal-queue")
.concurrentConsumers(5)
.to("fileWriter");
如果你不了解Camel,那么不值得学习这一项任务。但是,您可能需要使其多线程以获得最大性能。你必须尝试最好的线程,因为它取决于操作的哪些部分是最慢的。
多线程将占用更多内存,因此您需要平衡内存效率与性能。
答案 2 :(得分:0)
我会为每个用户打开/附加一个新的输出文件。如果您想最大限度地减少内存使用并产生更多的I / O开销,您可以执行以下操作,但您可能希望使用真正的CSV解析器,如Super CSV(http://supercsv.sourceforge.net/index.html):
Scanner s = new Scanner(new File("/my/dir/users-and-transactions.txt"));
while (s.hasNextLine()) {
String line = s.nextLine();
String[] tokens = line.split(",");
String user = tokens[0];
String transaction = tokens[1];
PrintStream out = new PrintStream(new FileOutputStream("/my/dir/" + user, true));
out.println(transaction);
out.close();
}
s.close();
如果你有合理的内存量,你可以为OutputStream创建一个用户名Map。每次看到用户字符串时,您都可以获取该用户名的现有OutputStream,或者如果不存在则创建新的OutputStream。