我以avro格式从hdfs加载并读取了一些数据,对它们执行了一些操作,然后我想将结果保存到文件中,如果文件很小或查询很简单,则可以工作,但是对于复杂的情况,它不会工作。 这是我的代码:
public final static String TAG = "Q5";
public static void main(String[] args) throws Exception {
//Save Logs to file
PrintStream orgOut = System.out;
PrintStream orgError = System.out;
PrintStream fileOut = new PrintStream("/results/5-out.txt");
PrintStream errorOut = new PrintStream("/results/5-error.txt");
Util.setPrintStream(orgOut, orgError, fileOut, errorOut);
//ENV Definations
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
BatchTableEnvironment tEnv = TableEnvironment.getTableEnvironment(env);
try {
Util.jobLog(TAG, Util.STATUS_START);
//create Path to HDFS and Read
Path ordersPath = new Path("hdfs://namenode:8020/mahan-data/orders.avro");
Path lineitemPath = new Path("hdfs://namenode:8020/mahan-data/lineitem.avro");
Path customerPath = new Path("hdfs://namenode:8020/mahan-data/customer.avro");
Path regionPath = new Path("hdfs://namenode:8020/mahan-data/region.avro");
Path supplierPath = new Path("hdfs://namenode:8020/mahan-data/supplier.avro");
Path nationPath = new Path("hdfs://namenode:8020/mahan-data/nation.avro");
AvroInputFormat<Orders> ordersAvroInputFormat = new AvroInputFormat<>(ordersPath, Orders.class);
AvroInputFormat<Customer> customerAvroInputFormat = new AvroInputFormat<>(customerPath, Customer.class);
AvroInputFormat<Nation> nationAvroInputFormat = new AvroInputFormat<>(nationPath, Nation.class);
AvroInputFormat<Lineitem> lineitemAvroInputFormat = new AvroInputFormat<>(lineitemPath, Lineitem.class);
AvroInputFormat<Supplier> supplierAvroInputFormat = new AvroInputFormat<>(supplierPath, Supplier.class);
AvroInputFormat<Region> regionAvroInputFormat = new AvroInputFormat<>(regionPath, Region.class);
//Create Tables
final DataSet<Lineitem> lineitemDataSet = env.createInput(lineitemAvroInputFormat);
final DataSet<Customer> customerDataSet = env.createInput(customerAvroInputFormat);
final DataSet<Nation> nationDataSet = env.createInput(nationAvroInputFormat);
final DataSet<Orders> ordersDataSet = env.createInput(ordersAvroInputFormat);
final DataSet<Supplier> supplierDataSet = env.createInput(supplierAvroInputFormat);
final DataSet<Region> regionDataSet = env.createInput(regionAvroInputFormat);
Table lineitem = tEnv.fromDataSet(lineitemDataSet);
Table orders = tEnv.fromDataSet(ordersDataSet);
Table nation = tEnv.fromDataSet(nationDataSet);
Table region = tEnv.fromDataSet(regionDataSet);
Table supplier = tEnv.fromDataSet(supplierDataSet);
Table customer = tEnv.fromDataSet(customerDataSet);
//Query
Table r_temp = region.filter("R_NAME == 'MIDDLE EAST' ");
Table o_temp = orders.filter("O_ORDERDATE >= '1994-01-01' ").filter("O_ORDERDATE < '1995-01-01' ");
Table c_o = customer.join(o_temp).where("C_CUSTKEY == O_CUSTKEY").select("O_ORDERKEY");
Table r_n = r_temp.join(nation).where("R_REGIONKEY == N_REGIONKEY");
Table r_n_s = r_n.join(supplier).where("N_NATIONKEY == S_NATIONKEY");
Table r_n_s_l = r_n_s.join(lineitem).where("S_SUPPKEY == L_SUPPKEY")
.select("N_NAME,L_EXTENDEDPRICE,L_DISCOUNT,L_ORDERKEY");
Table r_n_s_l_c_o = r_n_s_l.join(c_o).where("L_ORDERKEY == O_ORDERKEY");
Table res = r_n_s_l_c_o.groupBy("N_NAME")
.select("(L_EXTENDEDPRICE*(1-L_DISCOUNT)).sum as REVENUE")
.orderBy("REVENUE.desc");
//Convert Results
DataSet<Result5> result = tEnv.toDataSet(res, Result5.class);
//Print and Save Results
Util.log(TAG, "Result Count = " + result.count());
result.map(p->Result5.toTuple(p)).returns(new TypeHint<Tuple1<Float>>(){})
.writeAsCsv("hdfs://namenode:8020/mahan-data/flink/5-res.csv", "\n", "|");
//result.output(new AvroOutputFormat<>(new Path("/results/5.res"), Result5.class));
Util.jobLog(TAG, Util.STATUS_DONE);
} catch (Exception e) {
e.printStackTrace();
Util.jobLog(TAG, Util.STATUS_FAIELD);
}
env.execute("5-QUERY");
}
这是输出,显示了日志已打印并保存到所需文件,但甚至没有创建输出文件
2018/12/22 13:50:37.883-第5季度-状态=已开始
2018/12/22 15:23:47.791-第5季度-结果计数= 5
2018/12/22 15:23:47.824-第5季度-状态=完成
,该作业还将运行另外1或2个小时,然后消失了。我也尝试过写成csv一(avro)下面的行,并且结果相同。两种方法均已在服务器和便携式计算机上使用简单文件和查询进行了测试,无论我将其保存在服务器/笔记本电脑的根文件夹还是hdfs上,它们都可以正常工作。
快速问题:如果我不需要结果是否足以打印计数以确保查询执行时间? (不是我打印的日期,而是作业执行结束时由flink打印的作业执行时间)