Question

对不起我是Spark，scala和hadoop的新手我找到了很多关于这个天气记录数据的链接，但他们都使用HadoopMapreduce.java来执行，但是我们需要在spark中运行它来创建一个DataFrame

我正在尝试创建一个记录字段并在DataFrame上执行SQL语句的Dataframe，以检索每个月的最大，最小和平均温度

这就是我现在拥有的Dataframe（我使用反射来隐式推断架构）

scala> case class Weather(station:String, year:Int, month:Int, date:Int, hour:Int, temperature:Double)
scala> val test = input.map(_.split("")).map(p => Weather(p(0),p(1).toInt,p(2).toInt,p(3).toInt,p(4).toInt,p(5).toDouble ))

这样可以正常工作，但只能获得一位数的结果

scala> test.first()
res0: Weather = Weather(0,0,3,5,0,2.0)

现在问题是我找到一种方法来根据每行记录的范围来扩散数据集示例站是String（4,10）但我只能在我的地图中插入一个Integer值

无论如何我都可以获得.map（key =＆gt; value）

的范围

，例如.map(p => Weather(p(4-9),p(10-12))

或者有没有办法使用regex进行拆分？

编辑（我想我以错误的方式解释我的问题）

TLDR：包含许多数据记录的数据集（如上所示）必须找到将其溢出到行中的方法，并根据每条记录查找显示的信息是图片

Explaination for the Line of record

a full list of the dataset can be seen here

经过@Yaron的回答后我尝试了什么

 case class Weather(station:String, year:Int, month:Int, date:Int, hour:Int, temperature:Double)
    val splitdata = input.map(_.split(" "))
    scala> val test = splitdata.map(p => Weather(p.substring(4,10),p.substring(15,19).toInt,p.substring(19,21).toInt,p.substring(21,23).toInt,p.substring(23,27).toInt,p(87,92).toDouble ))

       val test = splitdata.map(p => Weather(p.substring(4,10),p.substring(15,19).toInt,p.substring(19,21).toInt,p.substring(21,23).toInt,p.substring(23,27).toInt,p(87,92).toDouble ))

Answer 1

示例如何设置substring，case class和map：

输入文件

我准备了一个测试文件/tmp/inp.txt，其中包含3个输入行的样本。

0035029070999991902010413004+64333+023450FM-12+000599999V0201601N015919999999N0000001N9-00941+99999098181ADDGF108991999999999999999999MW1721
0035029072999991902010413004+64333+023450FM-12+000599999V0201601N015919999999N0000001N9-00941+99999098181ADDGF108991999999999999999999MW1723
0035029075999991902010413004+64333+023450FM-12+000599999V0201601N015919999999N0000001N9-00941+99999098181ADDGF108991999999999999999999MW1728

我执行了以下命令（collect命令仅用于指令，并且不应在生产级程序中执行）

从本地文件中读取数据：

scala> val rdd = spark.read.textFile("file:////tmp/inp.txt") rdd: org.apache.spark.sql.Dataset[String] = [value: string]
显示rdd的内容

scala> rdd.collect res1: Array[String] = Array(0035029070999991902010413004+64333+023450FM-12+000599999V0201601N015919999999N0000001N9-00941+99999098181ADDGF108991999999999999999999MW1721, 0035029072999991902010413004+64333+023450FM-12+000599999V0201601N015919999999N0000001N9-00941+99999098181ADDGF108991999999999999999999MW1723, 0035029075999991902010413004+64333+023450FM-12+000599999V0201601N015919999999N0000001N9-00941+99999098181ADDGF108991999999999999999999MW1728)

定义case class

scala> case class Weather(station:String, year:Int, month:Int, date:Int, hour:Int, temperature:Double) defined class Weather

预先制作map：

scala> val rdd2 = rdd.map(p => Weather(p.substring(4,10),p.substring(15,19).toInt,p.substring(19,21).toInt,p.substring(21,23).toInt,p.substring(23,27).toInt,p.substring(87,92).toDouble )) rdd2: org.apache.spark.sql.Dataset[Weather] = [station: string, year: int ... 4 more fields]

显示新rdd2

scala> rdd2.collect res2: Array[Weather] = Array(Weather(029070,1902,1,4,1300,-94.0), Weather(029072,1902,1,4,1300,-94.0), Weather(029075,1902,1,4,1300,-94.0))

您可能希望使用String slice或substring方法：

scala> val mystr="0035029070999991902010413004+64333+023450FM-12+000599999V0201601N015919999999N0000001N9-00941+99999098181ADDGF108991999999999999999999MW1721"
mystr: String = 0035029070999991902010413004+64333+023450FM-12+000599999V0201601N015919999999N0000001N9-00941+99999098181ADDGF108991999999999999999999MW1721

scala> mystr.slice(3,5)
res155: String = 50

scala> mystr.slice(3,8)
res156: String = 50290

More info - slice-scala

以ASCII格式记录的数据集

1 个答案: