我有一个查询Elasticsearch的Java ETL,将数据打包成逗号分隔的行,将行追加到StringBuilder,在添加更多行时重复压缩(gzip)StringBuilder,直到达到大约9.2 MB,然后提交将压缩的字节流压缩为Salesforce Wave外部数据API作为数据存储,直到提交所有部件并完成事务(其中它成为可在SFDC Wave中可视化的数据)。
我遇到的挑战是,由于我将此代码重构为多线程方法以使其更快,因此出现了一个问题(在多线程之前它运行良好,尽管速度很慢)。特别是,Wave Data Manager偶尔(不一致)会在此代码提交的其中一个数据部分中遇到意外的EOF。
我正在展示我的代码中的关键元素,以便具有更多并发经验的人能够发现我的方法允许迷路EOF进入流程...
线程执行程序的相关部分是
long minMaxDelta = maxResultId - minResultId;
long theIncrementSize = minMaxDelta / ((long)iterations);
long threadIncrement = ((minMaxDelta) / threadSize) + 1;
long theCurrentMax = 0;
long theCurrentMin = minResultId;
for ( int ti= 0; ti < threadSize; ti++ ) {
theCurrentMax = theCurrentMin + threadIncrement;
Elk2WaveEtl etlThread = new Elk2WaveEtl(theCurrentMin, theCurrentMax,
incrementSize, targetDate, elkPassword, parentID,
partnerConnection);
new Thread(theGroup,etlThread, "elk2wave" + ti).start();
logger.info("started thread for : " + theCurrentMin + " <= " + theCurrentMax);
theCurrentMin = theCurrentMax;
}
因此,启动的线程数是一个输入参数(threadSize),其中每个线程都有一系列Elasticsearch ID来查询给定的日期。
当在每个线程内从Elasticsearch检索数据时,使用此函数调用将其打包成伪CSV行
StringBuilder theCsvFile = new StringBuilder();
theCsvFile.append(CSVUtils.makeStringLine(Arrays.asList(testId,
autobuildName, changelistOwner, scrumteam, testCategory,
bugNumber, depotPath, typeName, lastRunStatus, testOwner, devOwner,
className, runningTime, isBenchmark, isFailure, changelist,
autobuildId, runId, changelistEmail, startDate, status, testName,
failDate, testIdentifier, isFailure)));
makeStringLine函数的定义如下:
public static String makeStringLine(List<String> values, char separators, char customQuote) throws IOException {
boolean first = true;
//default customQuote is empty
if (separators == ' ') {
separators = DEFAULT_SEPARATOR;
}
StringBuilder sb = new StringBuilder();
for (String value : values) {
if (!first) {
sb.append(separators);
}
if (customQuote == ' ') {
sb.append(followCVSformat(value));
} else {
sb.append(customQuote).append(followCVSformat(value)).append(customQuote);
}
first = false;
}
sb.append("\n");
logger.debug(sb.toString() );
return sb.toString();
}
现在这是并发性相关的地方。由于数据的压缩字节流已提交给Wave外部数据API,因此我有一个同步代码块,因此各个线程在这里不会相互重叠。
private static volatile AtomicInteger p = new AtomicInteger(0);
public int increment() {
return p.incrementAndGet();
}
......
// write the test info to Wave when we have about 9MB of data
if ( compressedLength >= 1024*1000*9 ) { // 9 MB
byte[] theData = compress(theCsvFile);
theCsvFile = new StringBuilder();
compressedLength = 0;
if ( theData != null && theData.length > 0 ) {
synchronized(p) {
SObject isobj = new SObject();
isobj.setType("InsightsExternalDataPart");
isobj.setField("DataFile", theData);
isobj.setField("InsightsExternalDataId", parentID);
isobj.setField("PartNumber",increment()); //Part numbers should start at 1
logger.debug(" theRowSize " + theData.length);
SaveResult[] iresults = partnerConnection.create(new SObject[] { isobj });
for(SaveResult sv:iresults) {
if(sv.isSuccess()) {
String rowId = sv.getId();
logger.info("saved rowId " + rowId + " for part " + pvalue());
} else {
com.sforce.soap.partner.Error[] es = sv.getErrors();
for ( int w = 0; w < es.length; w++ ) {
logger.error(es[w].getMessage());
}
}
}
}
}
}
定义了压缩函数:
public static byte[] compress(StringBuilder data) throws IOException {
ByteArrayOutputStream bos = new ByteArrayOutputStream(data.toString().length());
GZIPOutputStream gzip = new GZIPOutputStream(bos);
gzip.write(data.toString().getBytes(StandardCharsets.UTF_8));
gzip.close();
byte[] compressed = bos.toByteArray();
bos.close();
return compressed;
}