使用scala从HDFS读取输入的xml数据

时间:2017-03-16 07:37:50

标签: xml scala apache-spark spark-dataframe

我正在尝试使用Scala中的Xml加载来解析Xml文件。 但是该程序无法从HDFS读取输入数据。 它只能从本地文件系统中读取。

有人可以帮我解释如何从HDFS读取输入数据。

尝试以下计划:

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import scala.collection.mutable.WrappedArray
import scala.collection.immutable.HashMap
import scala.collection.immutable.HashMap
import scala.collection.immutable.HashMap

object ProcessxmlInputFiles {

  def main(args: Array[String]) {
    val sc = new SparkContext(new SparkConf().setAppName("Parse XML Data").setMaster("local[*]"))
    val rawRDD = xml.XML.load(args(0))
    rawRDD.child.foreach { x =>
        var dateTime = StringBuilder.newBuilder
        x.child.foreach { x =>
          if ("header".equals(x.label)) {
            dateTime.append(x.child(1).attribute("dateTime").get.toString())
          }
          ...
         }
         ...
       }
         ...
       sc.stop
      }
 }

提前致谢!!

1 个答案:

答案 0 :(得分:0)

您可以使用databricks lib for xml

//imports
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, DoubleType};

/define schema for xml
val sqlContext = new SQLContext(sc)
val customSchema = StructType(Array(
    StructField("_id", StringType, nullable = true),
    StructField("column1", StringType, nullable = true),
    StructField("column2", StringType, nullable = true)))

//read xml file
val df = sqlContext.read
    .format("com.databricks.spark.xml")
    .option("rowTag", "item")
    .schema(customSchema)
    .load("file.xml") //You can provide local file "file:///<path to your xml>" for hdfs "hdfs://<path to file>"

//write the result
val selectedData = df.select("column1", "_id")
selectedData.write
    .format("com.databricks.spark.xml")
    .option("rootTag", "items")
    .option("rowTag", "item")
    .save("newfile.xml")