我正在学习scala,并且对这两种技术都很陌生:
假设我有这样的文件:
"1421453179.157" P0105451998 "SCREEN"
"1421453179.157" P0106586529 "PRESENTATION"
"1421453179.157" P0108481590 NULL
"1421453179.157" P0108481590 "SCREEN"
"1421453179.157" P0112397365 "FULL_SCREEN"
"1421453179.157" P0113994553 "FULL_SCREEN"
"1421453179.158" P0112360870 "DATA_INFO" dataId:5913974361807341112
"1421453179.159" P0112360870 "DATA_INFO" dataId:7658923479992321112
"1421453179.160" P0108137271 "SCREEN"
"1421453179.161" P0103681986 "SCREEN"
"1421453179.162" P0104229251 "PRESENTATION"
第一列是时间,第二列是user_id,第三列的含义取决于第4列中的数据。
我想完成以下事项:
我想找到连续的DATA_INFO记录并生成以下
P0112360870, 5913974361807341112|7658923479992321112
对该行的口头解释将是用户P0112360870
点击5913974361807341112|7658923479992321112
首次点击应该在此处开始5913974361807341112是第一次点击。
我从以下开始:
val data=sc.textFile("hdfs://*").map(line=> {val tks=line.split("\t",3); (tks(1),(tks(0),tks(2))) } )
val data2=data.groupBy( a=> a._1).take(1000)
但是不能想出从这里前进。
答案 0 :(得分:1)
val data=sc.textFile("hdfs://*").map( line => line.split( "\t" ).toList )
// you probably want only those with pxx with at least some data.
val filteredData = data.filter( l => l.length > 3 )
val groupedData = data.groupBy( l => l( 1 ) )
val iWantedThis = groupedData.map( ( pxxx, iterOfList ) => {
// every pxxxx group will have at least one entry with data.
val firstData = iterOfList.head( 3 )
// Now concatenate all other data's to the firstdata
val datas = iterOfList.tail.foldLeft( firstData )( ( fd, l ) => fd + "|" + l( 3 ) )
// return the string with \t as separtor.
List( pxxx, datas ).mkString( "\t" )
} )
答案 1 :(得分:1)
我认为你这样做的方式开始是错误的。如果您知道密钥,请使用以下内容将其设置为正确的键值元组:
sc.textFile("hdfs://*")
.map(_.split("\t",3)) //Split on tabs
.map(tks=>(tks(1),(tks(0),tks(2)))) //Create a (key, Tuple2) pairing
.reduceByKey(
(x,y)=>
if(x._1 contains "DATA_INFO") (s"${x._2}|${y._2}".replace("dataId:",""), "")
else x //Ignore duplicate non-DATA_INFO elements by dropping?????
)
最值得注意的是,你需要处理其他情况,但这是恰当的。
每次要求澄清
(s"${x._2}|${y._2}".replace("dataId:",""), "") //Using string interpolation
与
相同val concatenatedString = x._2 +"|"+y._2
val concatStringWithoutMetaData = concatenatedString.replace("dataId:","")
(concatStringWithoutMetaData, "") //Return the new string with an empty final column
答案 2 :(得分:1)
使用spark-shell
(就像Spark的REPL一样)来测试你的想法通常很有用。特别是当你刚接触它时。
运行spark shell(在bin/spark-shell
中),并创建测试数据集:
val input = """
"1421453179.157" P0105451998 "SCREEN"
"1421453179.157" P0106586529 "PRESENTATION"
"1421453179.157" P0108481590 NULL
"1421453179.157" P0108481590 "SCREEN"
"1421453179.157" P0112397365 "FULL_SCREEN"
"1421453179.157" P0113994553 "FULL_SCREEN"
"1421453179.158" P0112360870 "DATA_INFO" dataId:5913974361807341112
"1421453179.159" P0112360870 "DATA_INFO" dataId:7658923479992321112
"1421453179.160" P0108137271 "SCREEN"
"1421453179.161" P0103681986 "SCREEN"
"1421453179.162" P0104229251 "PRESENTATION""""
sc.parallelize(input.split("\n").map(_.trim)).map(_.split("\\s+")).
filter(_.length > 3). // take only > 3 (so containing dataId)
map(a => a(1) -> a(3).split(":")(1) ). // create a pair for each row your user -> click
reduceByKey(_ + "|" + _). // reduce clicks per user
collect // get it to the driver
当你运行它时,你应该看到更多或更少:
res0: Array[(String, String)] = Array((P0112360870,5913974361807341112|7658923479992321112))
我认为这就是你要找的东西。