Question

我正在开发一个spark应用程序，它需要访问和更新在hdfs中存储为文件的对象。我无法弄清楚我该怎么做？

如果我正在创建FileSystem hdfs对象并使用它：

boolean fileExists = hdfs.exists(new org.apache.hadoop.fs.Path(filePath));
if (fileExists){
    JavaRDD<MyObject> modelRDD = sc.objectFile(filePath);
}

我明白了：

ERROR Executor: Exception in task 110.0 in stage 1.0 (TID 112)

显示java.lang.NullPointerException

这段代码在工作人员处运行，所以我假设它失败了，因为它无法访问Spark Context。在这种情况下，如何访问此hdfs文件？

此hdfs文件驻留在驱动程序节点上。我可以用hive更改替换hdfs，在hive中将数据存储为字节数组，但是甚至不能从工作节点访问hive上下文。

添加完整代码以便更好地理解：

public class MyProgram {
    private static JavaSparkContext sc;
    private static HiveContext hiveContext;
    private static String ObjectPersistenceDir = "/metadata/objects";
    private static org.apache.hadoop.fs.FileSystem hdfs;
    private static String NameNodeURI = "hdfs://<mymachineurl>:9000";

    // create and maintain a cache of objects for every run session
    //private static HashMap<String, MyObject> cacheObjects;

    public static void main(String ... args) {
        System.out.println("Inside constructor: creating Spark context and Hive context");
        System.out.println("Starting Spark context and SQL context");

        sc = new JavaSparkContext(new SparkConf());
        hiveContext = new HiveContext(sc);

        //cacheObjects= new HashMap<>();

        //DataFrame loadedObjects= hiveContext.sql("select id, filepath from saved_objects where name = 'TEST'");
        //List<Row> rows = loadedObjects.collectAsList();
        //for(Row row : rows){
          //  String key =  (String) row.get(0) ;
          //  String value = (String) row.get(1);
          //  JavaRDD<MyObject> objectRDD = sc.objectFile(value);
          //  cacheObjects.put(key, objectRDD.first());
        //}


        DataFrame partitionedDF  = hiveContext.sql('select * from mydata');
        String partitionColumnName = "id";
        JavaRDD<Row> partitionedRecs = partitionedDF.repartition(partitionedDF.col(partitionColumnName)).javaRDD();


        FlatMapFunction<Iterator<Row>, MyObject> flatMapSetup = new FlatMapFunction<java.util.Iterator<Row>, MyObject>() {
            List<MyObject> lm_list = new ArrayList<>();
            MyObject object = null;

            @Override
            public List<MyObject> call(java.util.Iterator<Row> it) throws Exception {
                // for every row, create a record and update the object
                while (it.hasNext()) {
                    Row row = it.next();

                    if (object == null) {
                        String objectKey = "" + id;
                        //object = cacheObjects.get(objectKey);
                        String modelPath = ModelPersistenceDir + "/" +'TEST'+ "/" + id;
                        JavaRDD<MyObject> objectRDD = sc.objectFile(objectPath);
                        object = objectRDD.collect().get(0);

                        // object not in cache means not already created
                        if(object == null){
                            if (object == null){
                                ObjectDef objectDef = new ObjectDef('TEST');
                                object = new MyObject(objectDef);
                            }
                        }
                    }

                    /*
                    / some update on object
                    */

                    String objectKey = "" + id ;
                    cacheObjects.put(objectKey, object);

                    // Algorithm step 2.6 : to save in hive, add to list
                    lm_list.add(object);
                } // while Has Next ends
                return lm_list;
            } // Call -- Iterator ends
        };//); //Map Partition Ends

        //todo_nidhi put all objects in collectedObject back to hive
        List<MyObject> collectedObject =  partitionedRecs.mapPartitions(flatMapSetup).collect();
}

从spark worker节点访问hdfs文件

0 个答案: