我正在开发一个spark应用程序,它需要访问和更新在hdfs中存储为文件的对象。我无法弄清楚我该怎么做?
如果我正在创建FileSystem hdfs对象并使用它:
boolean fileExists = hdfs.exists(new org.apache.hadoop.fs.Path(filePath));
if (fileExists){
JavaRDD<MyObject> modelRDD = sc.objectFile(filePath);
}
我明白了:
ERROR Executor: Exception in task 110.0 in stage 1.0 (TID 112)
显示java.lang.NullPointerException
这段代码在工作人员处运行,所以我假设它失败了,因为它无法访问Spark Context。在这种情况下,如何访问此hdfs文件?
此hdfs文件驻留在驱动程序节点上。我可以用hive更改替换hdfs,在hive中将数据存储为字节数组,但是甚至不能从工作节点访问hive上下文。
添加完整代码以便更好地理解:
public class MyProgram {
private static JavaSparkContext sc;
private static HiveContext hiveContext;
private static String ObjectPersistenceDir = "/metadata/objects";
private static org.apache.hadoop.fs.FileSystem hdfs;
private static String NameNodeURI = "hdfs://<mymachineurl>:9000";
// create and maintain a cache of objects for every run session
//private static HashMap<String, MyObject> cacheObjects;
public static void main(String ... args) {
System.out.println("Inside constructor: creating Spark context and Hive context");
System.out.println("Starting Spark context and SQL context");
sc = new JavaSparkContext(new SparkConf());
hiveContext = new HiveContext(sc);
//cacheObjects= new HashMap<>();
//DataFrame loadedObjects= hiveContext.sql("select id, filepath from saved_objects where name = 'TEST'");
//List<Row> rows = loadedObjects.collectAsList();
//for(Row row : rows){
// String key = (String) row.get(0) ;
// String value = (String) row.get(1);
// JavaRDD<MyObject> objectRDD = sc.objectFile(value);
// cacheObjects.put(key, objectRDD.first());
//}
DataFrame partitionedDF = hiveContext.sql('select * from mydata');
String partitionColumnName = "id";
JavaRDD<Row> partitionedRecs = partitionedDF.repartition(partitionedDF.col(partitionColumnName)).javaRDD();
FlatMapFunction<Iterator<Row>, MyObject> flatMapSetup = new FlatMapFunction<java.util.Iterator<Row>, MyObject>() {
List<MyObject> lm_list = new ArrayList<>();
MyObject object = null;
@Override
public List<MyObject> call(java.util.Iterator<Row> it) throws Exception {
// for every row, create a record and update the object
while (it.hasNext()) {
Row row = it.next();
if (object == null) {
String objectKey = "" + id;
//object = cacheObjects.get(objectKey);
String modelPath = ModelPersistenceDir + "/" +'TEST'+ "/" + id;
JavaRDD<MyObject> objectRDD = sc.objectFile(objectPath);
object = objectRDD.collect().get(0);
// object not in cache means not already created
if(object == null){
if (object == null){
ObjectDef objectDef = new ObjectDef('TEST');
object = new MyObject(objectDef);
}
}
}
/*
/ some update on object
*/
String objectKey = "" + id ;
cacheObjects.put(objectKey, object);
// Algorithm step 2.6 : to save in hive, add to list
lm_list.add(object);
} // while Has Next ends
return lm_list;
} // Call -- Iterator ends
};//); //Map Partition Ends
//todo_nidhi put all objects in collectedObject back to hive
List<MyObject> collectedObject = partitionedRecs.mapPartitions(flatMapSetup).collect();
}