我用Java编写了一个servlet代码,用于读取存储在Google云端存储中的文件行。一旦我读完每一行,我就把它传递给预测API。一旦我得到了通过的文本的预测。我将其附加到原始行并将其存储在Google云存储中的其他文件中。
此源文件是csv,有超过10,000条记录。因为我正在单独解析它,将其传递给预测API,然后存储回云存储。这需要很多时间。由于App Engine限制为30个部分,因此任务队列也有限制。有人建议我一些选择吗?由于重新启动程序不是一个选项,因为我无法从我停止的位置开始预测。
这是我的代码:
@SuppressWarnings("serial")
public class PredictionWebAppServlet extends HttpServlet {
private static final String APPLICATION_NAME = "span-test-app";
static final String MODEL_ID = "span-senti";
static final String STORAGE_DATA_LOCATION = "/bigdata/training_set/";
private static HttpTransport httpTransport;
private static final JsonFactory JSON_FACTORY = JacksonFactory
.getDefaultInstance();
public static final String INPUT_BUCKETNAME = "bigdata";
public static final String INPUT_FILENAME = "abc.csv";
public static final String OUTPUT_BUCKETNAME = "bigdata";
public static final String OUTPUT_FILENAME = "def.csv";
private static Credential authorize() throws Exception {
Credential cr = new GoogleCredential.Builder()
.setTransport(httpTransport)
.setJsonFactory(JSON_FACTORY)
.setServiceAccountId(
"878482284233-aacp8vd5297aqak7v5r0f507qr63mab4@developer.gserviceaccount.com")
.setServiceAccountScopes(
Collections.singleton(PredictionScopes.PREDICTION))
.setServiceAccountPrivateKeyFromP12File(
new File(
"28617ba6faac0a51eb2208edba85d2e20e6081b4-privatekey.p12"))
.build();
return cr;
}
public void doGet(HttpServletRequest req, HttpServletResponse resp)
throws IOException {
try {
httpTransport = GoogleNetHttpTransport.newTrustedTransport();
Credential credential = authorize();
Prediction prediction = new Prediction.Builder(httpTransport,
JSON_FACTORY, credential).setApplicationName(APPLICATION_NAME)
.build();
GcsService gcsService = GcsServiceFactory.createGcsService();
GcsFilename filename = new GcsFilename(INPUT_BUCKETNAME, INPUT_FILENAME);
GcsFilename filename1 = new GcsFilename(OUTPUT_BUCKETNAME,
OUTPUT_FILENAME);
GcsFileOptions options = new GcsFileOptions.Builder()
.mimeType("text/html").acl("public-read")
.addUserMetadata("myfield1", "my field value").build();
GcsOutputChannel writeChannel = gcsService.createOrReplace(filename1, options);
PrintWriter writer = new PrintWriter(Channels.newWriter(writeChannel,
"UTF8"));
GcsInputChannel readChannel = null;
BufferedReader reader = null;
readChannel = gcsService.openReadChannel(filename, 0);
reader = new BufferedReader(Channels.newReader(readChannel, "UTF8"));
String line;
String cvsSplitBy = ",";
String temp_record = "";
Input input = new Input();
InputInput inputInput = new InputInput();
while ((line = reader.readLine()) != null) {
String[] post = line.split(cvsSplitBy);
inputInput.setCsvInstance(Collections
.<Object> singletonList(post[1]));
input.setInput(inputInput);
Output output = prediction.trainedmodels()
.predict("878482284233", MODEL_ID, input).execute();
for (int i = 0; i < 10; i++) {
temp_record = temp_record + post[i] + ",";
}
temp_record = temp_record + output.getOutputLabel();
writer.println(temp_record);
}
writer.flush();
writer.close();
//resp.getWriter().println(temp_record);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
finally{
}
}
}
答案 0 :(得分:5)
你自己暗示着它。
如果您认为您的工作可以在10分钟内完成,您可以单独使用任务队列。
如果没有,您将需要使用任务队列和后端的组合。您需要将其推入后端实例。看看Push queues and backends
更新 - 使用模块而不是后端
不推荐使用后端,以支持模块。使用模块的方法是:
手动缩放实例不会限制它们运行的时间。你可以永远地运行&#34;在&#34; / _ ah / start&#34;如果实例具有手动缩放请求。嘿,如果你愿意,你甚至可以开始线程。但这项工作不应该是必要的。直到完成为止。
答案 1 :(得分:0)
这类事物正是MapReduce framework的用途。