我有一个List<String>
数据。类似的东西:
[[dev, engg, 10000], [karthik, engg, 20000]..]
我知道这些数据的架构。
name (String)
degree (String)
salary (Integer)
我试过了:
JavaRDD<String> data = new JavaSparkContext(sc).parallelize(datas);
DataFrame df = sqlContext.read().json(data);
df.printSchema();
df.show(false);
输出:
root
|-- _corrupt_record: string (nullable = true)
+-----------------------------+
|_corrupt_record |
+-----------------------------+
|[dev, engg, 10000] |
|[karthik, engg, 20000] |
+-----------------------------+
因为List<String>
不是合适的JSON。
我是否需要创建正确的JSON,还是有其他方法可以做到这一点?
答案 0 :(得分:8)
您可以从array('50','16','0','387','2','49')
array('+','-','*','/','+')
创建DataFrame,然后使用List<String>
和selectExpr
获取所需的DataFrame。
split
您将获得以下输出。
public class SparkSample{
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("SparkSample").setMaster("local[*]");
JavaSparkContext jsc = new JavaSparkContext(conf);
SQLContext sqc = new SQLContext(jsc);
// sample data
List<String> data = new ArrayList<String>();
data.add("dev, engg, 10000");
data.add("karthik, engg, 20000");
// DataFrame
DataFrame df = sqc.createDataset(data, Encoders.STRING()).toDF();
df.printSchema();
df.show();
// Convert
DataFrame df1 = df.selectExpr("split(value, ',')[0] as name", "split(value, ',')[1] as degree","split(value, ',')[2] as salary");
df1.printSchema();
df1.show();
}
}
您提供的示例数据包含空格。如果你想删除空格并且工资类型为&#34;整数&#34;然后您可以使用root
|-- value: string (nullable = true)
+--------------------+
| value|
+--------------------+
| dev, engg, 10000|
|karthik, engg, 20000|
+--------------------+
root
|-- name: string (nullable = true)
|-- degree: string (nullable = true)
|-- salary: string (nullable = true)
+-------+------+------+
| name|degree|salary|
+-------+------+------+
| dev| engg| 10000|
|karthik| engg| 20000|
+-------+------+------+
和trim
功能,如下所示。
cast
答案 1 :(得分:1)
任务可以在没有JSON的情况下在Scala上完成:
val data = List("dev, engg, 10000", "karthik, engg, 20000")
val intialRdd = sparkContext.parallelize(data)
val splittedRDD = intialRdd.map(current => {
val array = current.split(",")
(array(0), array(1), array(2))
})
import sqlContext.implicits._
val dataframe = splittedRDD.toDF("name", "degree", "salary")
dataframe.show()
输出是:
+-------+------+------+
| name|degree|salary|
+-------+------+------+
| dev| engg| 10000|
|karthik| engg| 20000|
+-------+------+------+
注意:(array(0),array(1),array(2))是Scala Tuple
答案 2 :(得分:1)
DataFrame createNGramDataFrame(JavaRDD<String> lines) {
JavaRDD<Row> rows = lines.map(new Function<String, Row>(){
private static final long serialVersionUID = -4332903997027358601L;
@Override
public Row call(String line) throws Exception {
return RowFactory.create(Arrays.asList(line.split("\\s+")));
}
});
StructType schema = new StructType(new StructField[] {
new StructField("words",
DataTypes.createArrayType(DataTypes.StringType), false,
Metadata.empty()) });
DataFrame wordDF = new SQLContext(jsc).createDataFrame(rows, schema);
// build a bigram language model
NGram transformer = new NGram().setInputCol("words")
.setOutputCol("ngrams").setN(2);
DataFrame ngramDF = transformer.transform(wordDF);
ngramDF.show(10, false);
return ngramDF;
}