在java中通过Spark存储orc格式

时间:2015-08-18 17:32:57

标签: hadoop apache-spark apache-spark-sql orc

我正在使用spark 1.3.1,我希望将数据作为ORC格式存储在配置单元中。

下面显示错误的行,看起来orc不支持spark 1.3.1中的数据源

dataframe.save("/apps/hive/warehouse/person_orc_table_5", "orc");

java.lang.RuntimeException: Failed to load class for data source: orc
    at scala.sys.package$.error(package.scala:27)
    at org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:194)
    at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:237)
    at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1196)
    at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1156)
    at SparkOrcHive.main(SparkOrcHive.java:62)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:577)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:174)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:197)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Spark 1.4有..

write.format("orc").partitionBy("age").save("peoplePartitioned") 

以存储为orc格式..

有没有办法在Spark 1.3.1中以ORC格式存储文件?

谢谢,

1 个答案:

答案 0 :(得分:1)

dataframe.select(" name"," age")。save(" / apps / hive / warehouse / orc_table"," org。 apache.spark.sql.hive.orc",SaveMode.Append);

编辑:

我从hdfs获取txt文件并以orc格式将数据写入hive表。以下代码在spark 1.3.1中正常工作

java class

package com.test.spark;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SaveMode;
import org.apache.spark.sql.hive.HiveContext;

/**
 * Created by ankit on 08/02/16.
 */
public class SparkOrcHiveInsert {

    public static void main(String[] args) {

        String tableName = "person_orc";
        String tablePath = "/apps/hive/warehouse/" + tableName;

        SparkConf conf = new SparkConf().setAppName("ORC Demo").setMaster("local");
        JavaSparkContext sc = new JavaSparkContext(conf);

        HiveContext hiveContext = new org.apache.spark.sql.hive.HiveContext(sc.sc());

        JavaRDD<Person> people = sc.textFile("hdfs://~:8020/tmp/person.txt").map(
                new Function<String, Person>() {
                    public Person call(String line) throws Exception {
                        return process(line);
                    }
                });


        DataFrame schemaPeople = hiveContext.createDataFrame(people, Person.class);
        schemaPeople.select("id","name", "age").save(tablePath, "org.apache.spark.sql.hive.orc", SaveMode.Append);
    }

    private static Person process(String line) {
        String[] parts = line.split(",");
        Person person = new Person();
        person.setId(Integer.parseInt(parts[0].trim()));
        person.setName(parts[1]);
        person.setAge(Integer.parseInt(parts[2].trim()));

        return person;
    }
}

Hive表脚本

create table person_orc (
  id int,
  name string,
  age int
) stored as orc tblproperties ("orc.compress"="NONE");

Spark提交命令

~/spark/bin/spark-submit --master local  --class com.test.spark.SparkOrcHiveInsert spark-orc-hive-1.0.jar