我最近开始使用spark。我在火花壳上练习。
我有一个数据集“movies.dat”,其格式如下:
MovieID,标题,流派
样本记录: -
2,Jumanji (1995),Adventure|Children|Fantasy
我希望生成1985年至1995年间发行的“恐怖”电影列表。
这是我的方法。
scala> val movies_data = sc.textFile("file:///home/cloudera/cs/movies.dat")
scala> val tags=movies_data.map(line=>line.split(","))
scala> tags.take(5)
res3: Array[Array[String]] = Array(Array(1, Toy Story (1995), Adventure|Animation|Children|Comedy|Fantasy), Array(2, Jumanji (1995), Adventure|Children|Fantasy), Array(3, Grumpier Old Men (1995), Comedy|Romance), Array(4, Waiting to Exhale (1995), Comedy|Drama|Romance), Array(5, Father of the Bride Part II (1995), Comedy))
scala> val horrorMovies = tags.filter(genre=>genre.contains("Horror"))
scala> horrorMovies.take(5)
res4: Array[Array[String]] = Array(Array(177, Lord of Illusions (1995), Horror), Array(220, Castle Freak (1995), Horror), Array(841, Eyes Without a Face (Les Yeux sans visage) (1959), Horror), Array(1105, Children of the Corn IV: The Gathering (1996), Horror), Array(1322, Amityville 1992: It's About Time (1992), Horror))
我只想使用Spark Shell检索数据。我能够检索所有“恐怖”类型的电影。 现在,有没有办法过滤掉这些电影,只获得1985年到1995年之间发行年份的电影?
答案 0 :(得分:0)
您可以编写逻辑以从分割线(数组)的第二个元素中提取年份,并与您具有的范围进行比较,如下所示
scala> val movies_data = sc.textFile("file:///home/cloudera/cs/movies.dat")
movies_data: org.apache.spark.rdd.RDD[String] = file:///home/cloudera/cs/movies.dat MapPartitionsRDD[5] at textFile at <console>:25
scala> val tags=movies_data.map(line=>line.split(","))
tags: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[6] at map at <console>:27
scala> val horrorMovies = tags.filter(genre => {
| val date = genre(1).substring(genre(1).lastIndexOf("(")+1, genre(1).lastIndexOf(")")).toInt
| date >= 1985 && date <= 1995 && genre(2).contains("Horror")
| })
horrorMovies: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[7] at filter at <console>:29
scala> horrorMovies.take(3)
res1: Array[Array[String]] = Array(Array(177, " Lord of Illusions (1995)", " Horror"), Array(220, " Castle Freak (1995)", " Horror"), Array(1322, " Amityville 1992: It's About Time (1992)", " Horror"))
我希望答案很有帮助
<强>被修改强>
您也可以使用regex
执行上述逻辑
scala> val horrorMovies = tags.filter(genre => {
| val str = """(\d+)""".r findAllIn genre(1) mkString
| val date = if(str.length == 4) str.toInt else 0
| date >= 1985 && date <= 1995 && genre(2).contains("Horror")
| })
warning: there was one feature warning; re-run with -feature for details
horrorMovies: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[7] at filter at <console>:33
其余代码与上述相同。
我希望答案很有帮助