我试图用Scma分隔符连接Scala中的XML属性。
scala> val fileRead = sc.textFile("source_file")
fileRead: org.apache.spark.rdd.RDD[String] = source_file MapPartitionsRDD[8] at textFile at <console>:21
scala> val strLines = fileRead.map(x => x.toString)
strLines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[9] at map at <console>:23
scala> val fltrLines = strLines.filter(_.contains("<record column1="))
fltrLines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[10] at filter at <console>:25
scala> fltrLines.take(5)
res1: Array[String] = Array("<record column1="Hello" column2="there" column3="how" column4="are you?" />", "<record column1=...."
scala> val elem = fltrLines.map{ scala.xml.XML.loadString _ }
elem: org.apache.spark.rdd.RDD[scala.xml.Elem] = MapPartitionsRDD[34] at map at <console>:27
这是我需要使用逗号连接column1,然后是第2列,然后是逗号,然后是第3列......实际上,我希望能够像column3,column1,column2那样更改顺序...同样。
scala> val attr = elem.map(_.attributes("column1"))
attr: org.apache.spark.rdd.RDD[Seq[scala.xml.Node]] = MapPartitionsRDD[35] at map at <console>:29
现在看来它是什么样的:
scala> attr.take(1)
res17: Array[String] = Array(Hello)
我需要这个:
scala> attr.take(1)
res17: Array[String] = Array(Hello, there, how, are you?)
或者,如果我愿意的话:
scala> attr.take(1)
res17: Array[String] = Array(are you?, there, Hello)
答案 0 :(得分:0)
这将做你想要的。您可以获取属性列表并对其进行排序,但请注意,只有当您的XML记录具有所有相同的column1, column2,
属性时,它才会起作用。
scala> elem.map { r =>
// get all attributes (columnN) and sort them
r.attributes.map {_.key}.toSeq.sorted.
// get the values and convert from Node to String
map { r.attributes(_).toString} // .toArray here if you want
// Array here instead of List
}.head
res33: Array[String] = List(Hello, there, how, are you?)
答案 1 :(得分:0)
所以这就是它对我有用的方式。我将我的行设置为scala.xml.Elem
,就像我之前一样:
scala> val fileRead = sc.textFile("source_file")
fileRead: org.apache.spark.rdd.RDD[String] = source_file MapPartitionsRDD[8] at textFile at <console>:21
scala> val strLines = fileRead.map(x => x.toString)
strLines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[9] at map at <console>:23
scala> val fltrLines = strLines.filter(_.contains("<record column1="))
fltrLines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[10] at filter at <console>:25
scala> fltrLines.take(5)
res1: Array[String] = Array("<record column1="Hello" column2="there" column3="how" column4="are you?" />", "<record column1=...."
scala> val elem = fltrLines.map{ scala.xml.XML.loadString _ }
elem: org.apache.spark.rdd.RDD[scala.xml.Elem] = MapPartitionsRDD[34] at map at <console>:27
但是这次不是使用attributes("AttributeName")
方法,而是使用了attributes.asAttrMap
,它给了我一个Map[String,String] = Map(Key1 -> Value1, Key2 -> Value2, ....)
类型:
scala> val mappedElem = elem.map(_.attributes.asAttrMap)
然后我指定了我自己的列顺序。这样,如果列或XML格式的属性不存在,数据将只显示null
。我可以将null
更改为我想要的任何内容:
val myVals = mappedElem.map { x => x.getOrElse("Column3", null) + ", " + x.getOrElse("Column1", null) }
为了获得列的随机顺序,我必须做的就是在将其转换为逗号分隔文件时调用它来更改XML文件中的列位置。
输出是:
how, Hello