学习火花和斯卡拉。我有一个处理xml文字的片段。但是当我尝试从文件加载xml时,我无法使其工作。可能我错过了一个重要的理解。会感激一些帮助。我正在使用cloudera VM,它有火花1.6&斯卡拉2.10.5。
场景:读取xml,提取id,名称并显示为id @ name。
scala> import scala.xml._
scala> val strxml = <employees>
| <employee><id>1</id><name>chris</name></employee>
| <employee><id>2</id><name>adam</name></employee>
| <employee><id>3</id><name>karl</name></employee>
| </employees>
strxml: scala.xml.Elem =
<employees>
<employee><id>1</id><name>chris</name></employee>
<employee><id>2</id><name>adam</name></employee>
<employee><id>3</id><name>karl</name></employee>
</employees>
scala> val t = strxml.flatMap(line => line \\ "employee")
t: scala.xml.NodeSeq = NodeSeq(<employee><id>1</id><name>chris</name></employee>, <employee><id>2</id><name>adam</name></employee>, <employee><id>3</id><name>karl</name></employee>)
scala> t.map(l => (l \\ "id").text + "@" + (l \\ "name").text).foreach(println)
1@chris
2@adam
3@karl
从文件加载它(抛出异常;我在这里做错了什么?)
scala> val filexml = sc.wholeTextFiles("file:///home/cloudera/test*")
filexml: org.apache.spark.rdd.RDD[(String, String)] = file:///home/cloudera/test* MapPartitionsRDD[66] at wholeTextFiles at <console>:30
scala> val lines = filexml.map(line => XML.loadString(line._2))
lines: org.apache.spark.rdd.RDD[scala.xml.Elem] = MapPartitionsRDD[89] at map at <console>:32
scala> val ft = lines.map(l => l \\ "employee")
ft: org.apache.spark.rdd.RDD[scala.xml.NodeSeq] = MapPartitionsRDD[99] at map at <console>:34
scala> ft.map(l => (l \\ "id").text + "@" + (l \\ "name").text).foreach(println)
Exception in task 0.0 in stage 63.0 (TID 63)
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog
文件内容
test.xml
<employees>
<employee><id>1</id><name>chris</name></employee>
<employee><id>2</id><name>adam</name></employee>
<employee><id>3</id><name>karl</name></employee>
</employees>
test2.xml
<employees>
<employee><id>4</id><name>hive</name></employee>
<employee><id>5</id><name>elixir</name></employee>
<employee><id>6</id><name>spark</name></employee>
</employees>
答案 0 :(得分:2)
回答我自己的问题。
scala> val filexml = sc.wholeTextFiles("file:///Volumes/BigData/sample_data/test*.xml")
filexml: org.apache.spark.rdd.RDD[(String, String)] = file:///Volumes/BigData/sample_data/test*.xml MapPartitionsRDD[1] at wholeTextFiles at <console>:24
scala> val lines = filexml.flatMap(line => XML.loadString(line._2) \\ "employee")
lines: org.apache.spark.rdd.RDD[scala.xml.Node] = MapPartitionsRDD[3] at flatMap at <console>:29
scala> lines.map(line => (line \\ "id").text + "@" + (line \\ "name").text).foreach(println)
1@chris
2@adam
3@karl
4@hive
5@elixir
6@spark
答案 1 :(得分:0)
这是用于在Spark中处理XML数据的java代码,根据您的要求将其转换为。
package packagename;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.SparkSession;
import com.databricks.spark.xml.XmlReader;
public class XmlreaderSpark {
public static void main(String arr[]){
String localxml="file path";
String booksFileTag = "user";
String warehouseLocation = "file:" + System.getProperty("user.dir") + "spark-warehouse";
System.out.println("warehouseLocation" + warehouseLocation);
SparkSession spark = SparkSession
.builder()
.master("local")
.appName("Java Spark SQL Example")
.config("spark.some.config.option", "some-value").config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport().config("set spark.sql.crossJoin.enabled", "true")
.getOrCreate();
SQLContext sqlContext = new SQLContext(spark);
Dataset<Row> df = (new XmlReader()).withRowTag(booksFileTag).xmlFile(sqlContext, localxml);
df.show();
}
}
您需要添加以下依赖项。
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-xml_2.10</artifactId>
<version>0.4.0</version>
</dependency>