Question

我有一个文本文件，其数据如下：

productId|price|saleEvent|rivalName|fetchTS 
123|45.52|Regular|ShopYourWay.com|2017-05-11 16:09:43 
678|1232.29|Daily|MarketPlace.com|2017-05-11 15:53:03 
777|908.57|Daily|VistaCart.com|2017-05-11 15:39:01

我必须在网站上找到产品的最低价格，例如我的输出应该是这样的：

case class Product(productId:String, price:Double, saleEvent:String, rivalName:String, fetchTS:String)

val cDF = spark.read.text("/home/prabhat/Documents/Spark/sampledata/competitor_data.txt")
val (header,values) = cDF.collect.splitAt(1)
values.foreach(x => Product(x(0).toString, x(1).toString.toDouble, 
x(2).toString, x(3).toString, x(4).toString))

我这样想：

 java.lang.ArrayIndexOutOfBoundsException: 1
 at org.apache.spark.sql.catalyst.expressions.GenericRow
 .get(rows.scala:174)
 at org.apache.spark.sql.Row$class.apply(Row.scala:163)
 at 
 org.apache.spark.sql.catalyst.expressions.GenericRow
 .apply(rows.scala:166
 )
 at $anonfun$1.apply(<console>:28)
 at $anonfun$1.apply(<console>:28)
 at scala.collection.IndexedSeqOptimized$class.foreach
 (IndexedSeqOptimized.scala:33)
 at 
 scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
 ... 49 elided

在运行最后一行时获取异常：

scala> values
res2: **Array[org.apache.spark.sql.Row]** = ` 
Array([123|78.73|Special|VistaCart.com|2017-05-11 15:39:30 ], 
[123|45.52|Regular|ShopYourWay.com|2017-05-11 16:09:43 ], 
[123|89.52|Sale|MarketPlace.com|2017-05-11 16:07:29 ], 
[678|1348.73|Regular|VistaCart.com|2017-05-11 15:58:06 ], 
[678|1348.73|Special|ShopYourWay.com|2017-05-11 15:44:22 ], 
[678|1232.29|Daily|MarketPlace.com|2017-05-11 15:53:03 ], 
[777|908.57|Daily|VistaCart.com|2017-05-11 15:39:01 ]`
scala>

以值：

打印价值

split("|")

我能够理解我需要scala> val xy = values.foreach(x => x.toString.split("|").toSeq) xy: Unit = ()。

Unit

所以在拆分它给我Product类后，即无效，因此无法将值加载到Product案例类中。如何将此Dataframe加载到$data = json_decode($json, true);案例类？我现在不想使用数据集，尽管数据集是类型安全的。

我正在使用Spark 2.3和Scala 2.11。

Answer 1

问题是split采用正则表达式，这意味着您需要使用"\\|"而不是"|"。此外，foreach需要更改为map才能实际给出返回值，即：

val xy = values.map(x => x.toString.split("\\|"))

但是，更好的方法是将数据作为带有|分隔符的csv文件读取。通过这种方式，您不需要以特殊方式处理标头，并且通过推断列类型，无需进行任何转换（此处我将fetchTS更改为时间戳）：

case class Product(productId: String, price: Double, saleEvent: String, rivalName: String, fetchTS: Timestamp)

val df = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .option("sep", "|")
  .csv("/home/prabhat/Documents/Spark/sampledata/competitor_data.txt")
  .as[Product]

最后一行将转换数据框以使用Product案例类。如果您想将其用作RDD，只需在最后添加.rdd。

完成此操作后，使用groupBy和agg获取最终结果。

如何使用Spark中的Dataframe将数据加载到Product case类中

1 个答案: