Java中List <string>的数据帧

时间:2017-04-26 12:04:46

标签: java apache-spark spark-dataframe

  • Spark版本:1.6.2
  • Java版本:7

我有一个List<String>数据。类似的东西:

[[dev, engg, 10000], [karthik, engg, 20000]..]

我知道这些数据的架构。

name (String)
degree (String)
salary (Integer)

我试过了:

JavaRDD<String> data = new JavaSparkContext(sc).parallelize(datas);
DataFrame df = sqlContext.read().json(data);
df.printSchema();
df.show(false);

输出:

root
 |-- _corrupt_record: string (nullable = true)


+-----------------------------+
|_corrupt_record              |
+-----------------------------+
|[dev, engg, 10000]           |
|[karthik, engg, 20000]       |
+-----------------------------+

因为List<String>不是合适的JSON。

我是否需要创建正确的JSON,还是有其他方法可以做到这一点?

3 个答案:

答案 0 :(得分:8)

您可以从array('50','16','0','387','2','49') array('+','-','*','/','+') 创建DataFrame,然后使用List<String>selectExpr获取所需的DataFrame。

split

您将获得以下输出。

public class SparkSample{
public static void main(String[] args) {
    SparkConf conf = new SparkConf().setAppName("SparkSample").setMaster("local[*]");
    JavaSparkContext jsc = new JavaSparkContext(conf);
    SQLContext sqc = new SQLContext(jsc);
    // sample data
    List<String> data = new ArrayList<String>();
    data.add("dev, engg, 10000");
    data.add("karthik, engg, 20000");
    // DataFrame
    DataFrame df = sqc.createDataset(data, Encoders.STRING()).toDF();
    df.printSchema();
    df.show();
    // Convert
    DataFrame df1 = df.selectExpr("split(value, ',')[0] as name", "split(value, ',')[1] as degree","split(value, ',')[2] as salary");
    df1.printSchema();
    df1.show(); 
   }
}

您提供的示例数据包含空格。如果你想删除空格并且工资类型为&#34;整数&#34;然后您可以使用root |-- value: string (nullable = true) +--------------------+ | value| +--------------------+ | dev, engg, 10000| |karthik, engg, 20000| +--------------------+ root |-- name: string (nullable = true) |-- degree: string (nullable = true) |-- salary: string (nullable = true) +-------+------+------+ | name|degree|salary| +-------+------+------+ | dev| engg| 10000| |karthik| engg| 20000| +-------+------+------+ trim功能,如下所示。

cast

答案 1 :(得分:1)

任务可以在没有JSON的情况下在Scala上完成:

val data = List("dev, engg, 10000", "karthik, engg, 20000")
val intialRdd = sparkContext.parallelize(data)
val splittedRDD = intialRdd.map(current => {
  val array = current.split(",")
  (array(0), array(1), array(2))
})
import sqlContext.implicits._
val dataframe = splittedRDD.toDF("name", "degree", "salary")
dataframe.show()

输出是:

+-------+------+------+
|   name|degree|salary|
+-------+------+------+
|    dev|  engg| 10000|
|karthik|  engg| 20000|
+-------+------+------+

注意:(array(0),array(1),array(2))是Scala Tuple

答案 2 :(得分:1)

DataFrame createNGramDataFrame(JavaRDD<String> lines) {
 JavaRDD<Row> rows = lines.map(new Function<String, Row>(){
    private static final long serialVersionUID = -4332903997027358601L;

    @Override
    public Row call(String line) throws Exception {
        return RowFactory.create(Arrays.asList(line.split("\\s+")));
    }
 });
 StructType schema = new StructType(new StructField[] {
        new StructField("words",
                DataTypes.createArrayType(DataTypes.StringType), false,
                Metadata.empty()) });
 DataFrame wordDF = new SQLContext(jsc).createDataFrame(rows, schema);
 // build a bigram language model
 NGram transformer = new NGram().setInputCol("words")
        .setOutputCol("ngrams").setN(2);
 DataFrame ngramDF = transformer.transform(wordDF);
 ngramDF.show(10, false);
 return ngramDF;
}