如何从String创建Dataframe?

时间:2017-05-17 15:09:09

标签: apache-spark spark-dataframe

我有一个如下所示的字符串,每行用新行分隔,用空格分隔。第一行是我的标题。

col1 col2 col3 col4 col5 col6 col7 col8
val1 val2 val3 val4 val5 val6 val7 val8
val9 val10 val11 val12 val13 val14 val15 val16
val17 val18 val19 val20 val21 val22 val23 val24

如何在Java中从字符串构建Spark DataFrame

2 个答案:

答案 0 :(得分:1)

我相信@Shankar Koirala已经通过将文本/字符串文件视为CSV文件(使用自定义分隔符" "而不是",")来提供Java解决方案。下面是相同方法的Scala等价:

val spark = org.apache.spark.sql.SparkSession.builder.
  master("local").
  appName("Spark custom CSV").
  getOrCreate

val df = spark.read.
  format("csv").
  option("header", "true").
  option("delimiter", " ").
  csv("/path/to/textfile")

df.show
+-----+-----+-----+-----+-----+-----+-----+-----+
| col1| col2| col3| col4| col5| col6| col7| col8|
+-----+-----+-----+-----+-----+-----+-----+-----+
| val1| val2| val3| val4| val5| val6| val7| val8|
| val9|val10|val11|val12|val13|val14|val15|val16|
|val17|val18|val19|val20|val21|val22|val23|val24|
+-----+-----+-----+-----+-----+-----+-----+-----+

[UPDATE]从字符串内容

创建DataFrame
val s: String = """col1 col2 col3 col4 col5 col6 col7 col8
                  |val1 val2 val3 val4 val5 val6 val7 val8
                  |val9 val10 val11 val12 val13 val14 val15 val16
                  |val17 val18 val19 val20 val21 val22 val23 val24
|"""

// remove header line
val s2 = s.substring(s.indexOf('\n') + 1)

// create RDD
val rdd = sc.parallelize( s2.split("\n").map(_.split(" ")) )

// create DataFrame
val df = rdd.map{ case Array(c1, c2, c3, c4, c5, c6, c7, c8) => (c1, c2, c3, c4, c5, c6, c7, c8) }.
  toDF("col1", "col2", "col3", "col4", "col5", "col6", "col7", "col8")

df.show
+-----+-----+-----+-----+-----+-----+-----+-----+
| col1| col2| col3| col4| col5| col6| col7| col8|
+-----+-----+-----+-----+-----+-----+-----+-----+
| val1| val2| val3| val4| val5| val6| val7| val8|
| val9|val10|val11|val12|val13|val14|val15|val16|
|val17|val18|val19|val20|val21|val22|val23|val24|
+-----+-----+-----+-----+-----+-----+-----+-----+

答案 1 :(得分:0)

您可以在spark Java API中读取csv文件,如下所示: 创建火花会话

SparkSession spark = SparkSession.builder()
  .master("local[*]")
  .appName("Example")
  .getOrCreate();

//read file with header true and delimiter as " " (space)
DataFrame df = spark.read
    .option("delimiter", " ")
    .option("header", true)
    .csv("path to file");
df.show();