我已经在这一段时间里摸不着头脑了。我有一个包含数千亿条记录的大型CSV文件。
我手头有一个简单的任务,用这个CSV文件创建JSON并将其发布到服务器。我想尽快完成这项任务。到目前为止,我读取CSV的代码如下:
protected void readIdentityCsvDynamicFetch() {
String csvFile = pathOfIdentities;
CSVReader reader = null;
PayloadEngine payloadEngine = new PayloadEngine();
long counter = 0;
int size;
List<IdentityJmPojo> identityJmList = new ArrayList<IdentityJmPojo>();
try {
ExecutorService uploaderPoolService = Executors.newFixedThreadPool(3);
long lineCount = lineCount(pathOfIdentities);
logger.info("Line Count: " + lineCount);
reader = new CSVReader(new BufferedReader(new FileReader(csvFile)), ',', '\'', OFFSET);
String[] line;
long startTime = System.currentTimeMillis();
while ((line = reader.readNext()) != null) {
// logger.info("Lines"+line[0]+line[1]);
IdentityJmPojo identityJmPojo = new IdentityJmPojo();
identityJmPojo.setIdentity(line[0]);
identityJmPojo.setJM(line.length > 1 ? line[1] : (jsonValue/*!=null?"":jsonValue*/));
identityJmList.add(identityJmPojo);
size = identityJmList.size();
switch (size) {
case STEP:
counter = counter + STEP;
payloadEngine.prepareJson(identityJmList, uploaderPoolService,jsonKey);
identityJmList = new ArrayList<IdentityJmPojo>();
long stopTime = System.currentTimeMillis();
long elapsedTime = stopTime - startTime;
logger.info("=================== Time taken to read " + STEP + " records from CSV: " + elapsedTime + " and total records read: " + counter + "===================");
}
}
if (identityJmList.size() > 0) {
logger.info("=================== Executing Last Loop - Payload Size: " + identityJmList.size() + " ================= ");
payloadEngine.prepareJson(identityJmList, uploaderPoolService, jsonKey);
}
uploaderPoolService.shutdown();
} catch (Throwable e) {
e.printStackTrace();
logger.error("CsvReader || readIdentityCsvDynamicFetch method ", e);
} finally {
try {
if (reader != null)
reader.close();
} catch (IOException e) {
e.printStackTrace();
logger.error("CsvReader || readIdentityCsvDynamicFetch method ", e);
}
}
}
现在我使用ThreadPool执行器服务,在其run()方法中我有一个Apache Http客户端设置将JSON发布到服务器。 (我正在使用连接池并保持活跃的策略,只打开和关闭conn一次)
我创造&amp;发布我的JSON如下:
public void prepareJson(List<IdentityJmPojo> identities, ExecutorService notificationService, String key) {
try {
notificationService.submit(new SendPushNotification(prepareLowLevelJson(identities, key)));
// prepareLowLevelJson(identities, key);
} catch (Exception e) {
e.printStackTrace();
logger.error("PayloadEngine || readIdentityCsvDynamicFetch method ", e);
}
}
private ObjectNode prepareLowLevelJson(List<IdentityJmPojo> identities, String key) {
long startTime = System.currentTimeMillis();
ObjectNode mainJacksonObject = JsonNodeFactory.instance.objectNode();
ArrayNode dJacksonArray = JsonNodeFactory.instance.arrayNode();
for (IdentityJmPojo identityJmPojo : identities) {
ObjectNode dSingleObject = JsonNodeFactory.instance.objectNode();
ObjectNode dProfileInnerObject = JsonNodeFactory.instance.objectNode();
dSingleObject.put("identity", identityJmPojo.getIdentity());
dSingleObject.put("ts", ts);
dSingleObject.put("type", "profile");
//
dProfileInnerObject.put(key, identityJmPojo.getJM());
dSingleObject.set("profileData", dProfileInnerObject);
dJacksonArray.add(dSingleObject);
}
mainJacksonObject.set("d", dJacksonArray);
long stopTime = System.currentTimeMillis();
long elapsedTime = stopTime - startTime;
logger.info("===================Time to create JSON: " + elapsedTime + "===================");
return mainJacksonObject;
}
现在出现了一个奇怪的部分,当我注释掉通知服务时:
// notificationService.submit(new SendPushNotification(prepareLowLevelJson(identities, key)));
一切顺利,我可以阅读CSV并在29000毫安以下准备JSON。
但是当要执行实际任务时,它会失败并且我得到内存不足错误,我认为这里存在设计缺陷。如何快速处理大量数据,我们将非常感谢任何提示。
我认为在for循环中创建Json对象和数组也占用了很多内存,但是我似乎没有找到替代方案。
这是堆栈跟踪:
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.LinkedHashMap.createEntry(LinkedHashMap.java:442)
at java.util.HashMap.addEntry(HashMap.java:884)
at java.util.LinkedHashMap.addEntry(LinkedHashMap.java:427)
at java.util.HashMap.put(HashMap.java:505)
at com.fasterxml.jackson.databind.node.ObjectNode._put(ObjectNode.java:861)
at com.fasterxml.jackson.databind.node.ObjectNode.put(ObjectNode.java:769)
at uploader.PayloadEngine.prepareLowLevelJson(PayloadEngine.java:50)
at uploader.PayloadEngine.prepareJson(PayloadEngine.java:24)
at uploader.CsvReader.readIdentityCsvDynamicFetch(CsvReader.java:97)
at uploader.Main.main(Main.java:30)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)