Question

我在输入文本文件中有数据。它包含以下格式的输入数据：“ PriceId，DateTime，PriceForUSA， PriceForUK，PriceForAUS”。

它看起来像这样：

0000002,11-05-08-2016,0.92,1.68,0.81

0000003,07-05-08-2016,0.80,1.05,1.49

0000008,07-05-08-2016,1.43,1.29,1.22

国家列表是固定的（美国，英国，AUS），行中价格的顺序也是固定的（PriceForUSA，PriceForUK，PriceForAUS）。

我使用Spark上下文从文件中读取了此数据，并将其转换为RDD [List [String []]。我的RDD中的每个列表代表输入文本文件中的一行。

例如

第一个列表包含字符串

"0000002", "11-05-08-2016", "0.92", "1.68", "0.81"

第二个列表包含字符串

"0000003", "07-05-08-2016" , "0.80", "1.05" , "1.49"

等

我也有自定义类PriceInfo

case class PriceInfo(priceId: String, priceDate: String, country: String, price: Double) {

  override def toString: String = s"$priceId,$priceDate,$country,$price"
}

将每个List [String]转换为此类的对象并不难，（我已经可以了），但是在这种情况下，我的任务是从每个List [String] 中获取多个自定义对象。

例如，包含

的列表

"0000002", "11-05-08-2016", "0.92", "1.68", "0.81"

应转换为：

PriceInfo（“ 0000002”，“ 11-05-08-2016”，“美国”，“ 0.92”）
PriceInfo（“ 0000002”，“ 11-05-08-2016”，“英国”，“ 1.68”）
PriceInfo（“ 0000002”，“ 11-05-08-2016”，“ AUS”，“ 0.81”）。

并且我的RDD [List [String]]中的每个List [String]必须以相同的方式“拆分”为几个PriceInfo对象。

结果应为RDD [PriceInfo]。

我想到的唯一解决方案是使用 foreach（）函数迭代RDD [List [String]]，在每次迭代中创建3个PriceInfo对象，然后在中添加所有创建的对象> List [PriceObjects] ，并在 SparkContext.parallelize（List ...）中使用此结果列表。

类似这样的东西：

rawPricesList.foreach(list => {

      //...create PriceInfo1 from list
      //...create PriceInfo2 from list
      //...create PriceInfo3 from list

      //...add them all to result List<PriceInfo>

    })

    //...sc.parallelize(List<PriceInfo>...)

但是这种解决方案有很多缺点。

最主要的是，如果我们没有指向SparkContext的链接，它将无法正常工作。例如，如果我们有一个方法getPrices（），该方法只有一个参数-RDD [List [String]]。

def getPrices(rawPricesList: RDD[List[String]]): RDD[PriceInfo] = {



    rawPricesList.foreach(list => {

      //...create PriceInfo1 from list
      //...create PriceInfo2 from list
      //...create PriceInfo3 from list

      //...add them all to result List<PriceInfo>

    })

    //...but we can't sc.parallelize(List...) here, because there is no SparkContext sc in method parameters
  }

此外，在我看来，Scala包含一个更优雅的解决方案。

我试图在《急躁的scala》和《 Learning Spark：Lightning-Fast Big Data Analysis》一书中找到类似的样本，但是不幸的是没有找到这种情况。我将非常感谢您的帮助和提示。

Answer 1

这是一种方法：

加载文本文件并将每一行拆分为（id，date，price1，price2，price3）的Array [String]
使用zip将每一行转换为（id，date，Array [（country，numericPrice）]）
使用PriceInfo将每行中的（country，numericPrice）元组平铺成flatMap个对象的行

下面的示例代码：

case class PriceInfo(priceId: String, priceDate: String, country: String, price: Double) {
  override def toString: String = s"$priceId,$priceDate,$country,$price"
}

val countryList = List("USA", "UK", "AUS")

val rdd = sc.textFile("/path/to/textfile").
  map( _.split(",") ).
  map{ case Array(id, date, p1, p2, p3) =>
    (id, date, countryList.zip(List(p1.toDouble, p2.toDouble, p3.toDouble)))
  }.
  flatMap{ case (id, date, countryPrices) =>
    countryPrices.map( cp => PriceInfo(id, date, cp._1, cp._2) ) 
  }
// rdd: org.apache.spark.rdd.RDD[PriceInfo] = ...

rdd.collect
// res1: Array[PriceInfo] = Array(
//    0000002,11-05-08-2016,USA,0.92,
//    0000002,11-05-08-2016,UK,1.68,
//    0000002,11-05-08-2016,AUS,0.81,
//    0000003,07-05-08-2016,USA,0.8,
//    0000003,07-05-08-2016,UK,1.05,
//    0000003,07-05-08-2016,AUS,1.49,
//    0000008,07-05-08-2016,USA,1.43,
//    0000008,07-05-08-2016,UK,1.29,
//    0000008,07-05-08-2016,AUS,1.22
// )

Spark / Scala：从RDD [List <string>]到RDD [自定义对象]

1 个答案: