这是我的文本文件,是程序的输入内容
Id Title Copy
B2002010 gyh 1
D2001001 abc 12
M2003005 zxc 3
D2002003 qwe 13
M2001002 efg 1
D2001004 asd 6
D2003005 zxc 3
M2001006 wer 6
D2001006 wer 6
B2004008 sxc 10
D2002007 sdf 9
D2004008 sxc 10
ID的格式为Xyyyyrrr,其中:
我需要做的是将第一个字母更改为一个单词。
例如:
(D2002,24) --> Dictionary,2002,24
我的Spark项目在Eclipse上,并且正在使用Maven和Scala IDE l。
package bd.spark_app
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql._
import org.apache.spark.sql.types.IntegerType
import scala.io.Source
import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray
import org.apache.log4j._
import org.apache.spark.sql.types.{StructType, StructField, StringType}
import org.apache.spark.sql.Row
import scala.Array
object alla { def main(args:Array[String]) = {
val conf =newSparkConf().setMaster("local")
.setAppName("trying ")
val sc = new SparkContext(conf)
val x =
sc.textFile("/home/hadoopusr/sampledata")
val converted = x.map(_.split(" ")).map(r =>
(r(0).dropRight(3), r(2).toInt)) val result =
converted.reduceByKey(_ + _)
sc.stop() } }
结果是
(M2001,7) (D2001,24) (M2003,3) (D2003,3) (D2002,22) (D2004,10) (B2002,1) (B2004,10)
我希望结果是
(Magazine, 2001 ,7)
(Dictionary, 2001, 24)
(Magazine ,2003, 3)
(Dictionary, 2003, 3).
以此类推。
一个简单的功能会有所帮助。
答案 0 :(得分:2)
能帮上忙吗?
rdd.map(_.split(" "))
.map(str => ((str.head.head match {
case 'M' => "Magazine"
case 'B' => "Book"
case 'D' => "Dictionary"
case _ => ???
}, str.head.drop(1).dropRight(3).toInt), str.last.toInt))
.reduceByKey(_ + _)
.map(tuple => (tuple._1._1, tuple._1._2, tuple._2))
示例输出(已验证):
(Magazine,2003,3),(Dictionary,2001,24),(Dictionary,2003,3), (Book,2002,1),(Magazine,2001,7),(Book,2004,10), (Dictionary,2002,22),(Dictionary,2004,10)