scala-spark:背靠背点击

时间:2015-02-18 21:05:55

标签: scala apache-spark

我正在学习scala,并且对这两种技术都很陌生:

假设我有这样的文件:

"1421453179.157"        P0105451998  "SCREEN"   
"1421453179.157"        P0106586529  "PRESENTATION"     
"1421453179.157"        P0108481590   NULL    
"1421453179.157"        P0108481590  "SCREEN"        
"1421453179.157"        P0112397365  "FULL_SCREEN"   
"1421453179.157"        P0113994553  "FULL_SCREEN"   
"1421453179.158"        P0112360870  "DATA_INFO"    dataId:5913974361807341112
"1421453179.159"        P0112360870  "DATA_INFO"    dataId:7658923479992321112   
"1421453179.160"        P0108137271  "SCREEN"   
"1421453179.161"        P0103681986  "SCREEN"   
"1421453179.162"        P0104229251  "PRESENTATION"  

第一列是时间,第二列是user_id,第三列的含义取决于第4列中的数据。

我想完成以下事项:

我想找到连续的DATA_INFO记录并生成以下

P0112360870, 5913974361807341112|7658923479992321112

对该行的口头解释将是用户P0112360870点击5913974361807341112|7658923479992321112首次点击应该在此处开始5913974361807341112是第一次点击。

我从以下开始:

val data=sc.textFile("hdfs://*").map(line=> {val tks=line.split("\t",3); (tks(1),(tks(0),tks(2))) } )
val data2=data.groupBy( a=> a._1).take(1000)

但是不能想出从这里前进。

3 个答案:

答案 0 :(得分:1)

val data=sc.textFile("hdfs://*").map( line => line.split( "\t" ).toList )

// you probably want only those with pxx with at least some data.
val filteredData = data.filter( l => l.length > 3 )

val groupedData = data.groupBy( l => l( 1 ) )

val iWantedThis = groupedData.map( ( pxxx, iterOfList ) => {
    // every pxxxx group will have at least one entry with data.
    val firstData = iterOfList.head( 3 )
    // Now concatenate all other data's to the firstdata
    val datas = iterOfList.tail.foldLeft( firstData )( ( fd, l ) => fd + "|" + l( 3 ) )
    // return the string with \t as separtor.
    List( pxxx, datas ).mkString( "\t" )
} )

答案 1 :(得分:1)

我认为你这样做的方式开始是错误的。如果您知道密钥,请使用以下内容将其设置为正确的键值元组:

sc.textFile("hdfs://*")
  .map(_.split("\t",3)) //Split on tabs
  .map(tks=>(tks(1),(tks(0),tks(2)))) //Create a (key, Tuple2) pairing
  .reduceByKey(
    (x,y)=>
    if(x._1 contains "DATA_INFO") (s"${x._2}|${y._2}".replace("dataId:",""), "")
    else x //Ignore duplicate non-DATA_INFO elements by dropping?????
  )

最值得注意的是,你需要处理其他情况,但这是恰当的。

每次要求澄清

(s"${x._2}|${y._2}".replace("dataId:",""), "") //Using string interpolation

相同
val concatenatedString = x._2 +"|"+y._2
val concatStringWithoutMetaData = concatenatedString.replace("dataId:","")
(concatStringWithoutMetaData, "") //Return the new string with an empty final column

答案 2 :(得分:1)

使用spark-shell(就像Spark的REPL一样)来测试你的想法通常很有用。特别是当你刚接触它时。

运行spark shell(在bin/spark-shell中),并创建测试数据集:

val input = """
"1421453179.157"        P0105451998  "SCREEN"   
"1421453179.157"        P0106586529  "PRESENTATION"     
"1421453179.157"        P0108481590   NULL    
"1421453179.157"        P0108481590  "SCREEN"        
"1421453179.157"        P0112397365  "FULL_SCREEN"   
"1421453179.157"        P0113994553  "FULL_SCREEN"   
"1421453179.158"        P0112360870  "DATA_INFO"    dataId:5913974361807341112
"1421453179.159"        P0112360870  "DATA_INFO"    dataId:7658923479992321112   
"1421453179.160"        P0108137271  "SCREEN"   
"1421453179.161"        P0103681986  "SCREEN"   
"1421453179.162"        P0104229251  "PRESENTATION""""


sc.parallelize(input.split("\n").map(_.trim)).map(_.split("\\s+")).
  filter(_.length > 3). // take only > 3 (so containing dataId)
  map(a => a(1) -> a(3).split(":")(1) ). // create a pair for each row your user -> click
  reduceByKey(_ + "|" + _). // reduce clicks per user
  collect // get it to the driver

当你运行它时,你应该看到更多或更少:

res0: Array[(String, String)] = Array((P0112360870,5913974361807341112|7658923479992321112))

我认为这就是你要找的东西。