我是Spark Scala的新手。我们如何处理文本文件中的xml以及其他以数据分隔的数据
Johndoe|Newyork|USA|<root><abc1 is=“tug” zip=“4567” state=“NY”/></root>
Smith|Jersey|USA|<root>abc2 is=“tug” zip=“7899” state=“NJ”/></root>
输出应为
Johndoe|Newyork|USA|tug|4567|NY
Smith|Jersey|USA|tig|7899|NJ
下面是我尝试的代码,但出现错误
错误:(21,21)构造函数无法实例化为预期的类型; 找到:(T1,T2,T3,T4) 必需:字符串 file1.map {case(name:String,city:String,country:String,xml2)=>
import scala.xml._
import org.apache.spark.sql.types._
import org.apache.log4j.{Level, LogManager, Logger}
import org.apache.spark.sql.{DataFrame, Row, SQLContext, SparkSession}
import com.databricks.spark.xml._
object xml_parse {
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.OFF)
LogManager.getRootLogger.setLevel(Level.WARN)
val spark= SparkSession.builder()
.appName("parsexml")
.master("local[*]")
.getOrCreate()
val sc=spark.sparkContext
val sq=spark.sqlContext
sc.setLogLevel("WARN")
val file1= sc.textFile("in\\input.txt")
file1.map{ case (name, city, country, xml2) =>
(name,
city,
country,
(XML.loadString(xml2) \\ "root" \\ "abc1").text)
}.collect().foreach(println)
}
}