Spark shell - 如何根据时间段或在2个给定日期或两年之间从数据集中检索行

时间:2017-08-15 16:25:28

标签: scala shell apache-spark

我最近开始使用spark。我在火花壳上练习。

我有一个数据集“movies.dat”,其格式如下:

MovieID,标题,流派

样本记录: -

2,Jumanji (1995),Adventure|Children|Fantasy

我希望生成1985年至1995年间发行的“恐怖”电影列表。

这是我的方法。

scala> val movies_data = sc.textFile("file:///home/cloudera/cs/movies.dat")

scala> val tags=movies_data.map(line=>line.split(","))

scala> tags.take(5)
res3: Array[Array[String]] = Array(Array(1, Toy Story (1995), Adventure|Animation|Children|Comedy|Fantasy), Array(2, Jumanji (1995), Adventure|Children|Fantasy), Array(3, Grumpier Old Men (1995), Comedy|Romance), Array(4, Waiting to Exhale (1995), Comedy|Drama|Romance), Array(5, Father of the Bride Part II (1995), Comedy))

scala> val horrorMovies = tags.filter(genre=>genre.contains("Horror"))

scala> horrorMovies.take(5)
res4: Array[Array[String]] = Array(Array(177, Lord of Illusions (1995), Horror), Array(220, Castle Freak (1995), Horror), Array(841, Eyes Without a Face (Les Yeux sans visage) (1959), Horror), Array(1105, Children of the Corn IV: The Gathering (1996), Horror), Array(1322, Amityville 1992: It's About Time (1992), Horror))

我只想使用Spark Shell检索数据。我能够检索所有“恐怖”类型的电影。 现在,有没有办法过滤掉这些电影,只获得1985年到1995年之间发行年份的电影?

1 个答案:

答案 0 :(得分:0)

您可以编写逻辑以从分割线(数组)的第二个元素中提取年份,并与您具有的范围进行比较,如下所示

scala> val movies_data = sc.textFile("file:///home/cloudera/cs/movies.dat")
movies_data: org.apache.spark.rdd.RDD[String] = file:///home/cloudera/cs/movies.dat MapPartitionsRDD[5] at textFile at <console>:25

scala> val tags=movies_data.map(line=>line.split(","))
tags: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[6] at map at <console>:27

scala> val horrorMovies = tags.filter(genre => {
     | val date = genre(1).substring(genre(1).lastIndexOf("(")+1, genre(1).lastIndexOf(")")).toInt
     | date >= 1985 && date <= 1995 && genre(2).contains("Horror")
     | })
horrorMovies: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[7] at filter at <console>:29

scala> horrorMovies.take(3)
res1: Array[Array[String]] = Array(Array(177, " Lord of Illusions (1995)", " Horror"), Array(220, " Castle Freak (1995)", " Horror"), Array(1322, " Amityville 1992: It's About Time (1992)", " Horror"))

我希望答案很有帮助

<强>被修改

您也可以使用regex执行上述逻辑

scala> val horrorMovies = tags.filter(genre => {
     | val str = """(\d+)""".r findAllIn genre(1) mkString
     | val date = if(str.length == 4) str.toInt else 0
     | date >= 1985 && date <= 1995 && genre(2).contains("Horror")
     | })
warning: there was one feature warning; re-run with -feature for details
horrorMovies: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[7] at filter at <console>:33

其余代码与上述相同。

我希望答案很有帮助