这是我的文件,其中包含以下行:
**name,bus_id,bus_timing,bus_ticket**
yuhindmklwm00409219,958193628,0305delete,2700)
(yuhindmklwm00409219,958193628,0305delete,800)
(yuhindmklwm00409219,959262446,0219delete,62)
(yuhindmklwm00437293,752013801,0220delete,2700)
(yuhindmklwm00437293,85382,0126delete,500)
(yuhindmklwm00437293,863056514,0326delete,-2700)
(yuhindmklwm00437293,863056514,0326delete,2700)
(yuhindmklwm00437293,85258,0313delete,1000)
(yuhindmklwm00437293,85012,0311delete,1000)
(yuhindmklwm00437293,85718,0311delete,2700)
(yuhindmklwm00437293,744622574,0322delete,90)
(yuhindmklwm00437293,83704,0215delete,17)
(yuhindmklwm00437293,85253,0331delete,-2700)
(yuhindmklwm00437293,85253,0331delete,2700)
(yuhindmklwm00437293,752013801,0305delete,2700)
(yuhindmklwm00437293,33165,0315delete,1000)
(yuhindmklwm00437293,85018,0319delete,100)
(yuhindmklwm00437293,85018,0219delete,100)
(yuhindmklwm00437293,85018,0118delete,100)
(yuhindmklwm00437293,90265,0312delete,6)
(yuhindmklwm00437293,02465,0312delete,25)
(yuhindmklwm00437293,857164939,0313delete,15)
(yuhindmklwm00437293,22102,0313delete,4)
(yuhindmklwm00437293,55423,0313delete,100)
(yuhindmklwm00437293,02465,0314delete,1)
(yuhindmklwm00437293,90265,0312delete,1)
(yuhindmklwm00437293,93108,0315delete,25)
(yuhindmklwm00437293,220432304,0315delete,35)
(yuhindmklwm00437293,701211570,0315delete,35)
(yuhindmklwm00437293,28801,0315delete,10)
(yuhindmklwm00437293,93108,0211delete,3)
(yuhindmklwm00437293,93108,02)
我的最终输出应包含重复记录及其出现金额和百分位数。
name,bus_id,bus_timing, 60th percentile value of bus_ticket, sum_bus_ticket, occurence)
yuhindmklwm00409219,958193628,0305delete,2000, 2700, 1)
yuhindmklwm00409219,958193628,0305delete,2000, 3500, 2)
.......
.......
......
这可以通过列表解决,但是有人能想到其他数据结构效率不高吗?
如果您忽略聚合(例如总和或百分位数),那就没关系。但至少应该有一个聚合。
这是我的百分位函数:
scala> def percentileValue(p: Int,data: List[Int]): Int = {val firstSort=data.sorted; val k=math.ceil((data.size-1) * (p / 100.0)).toInt; return firstSort(k).toInt}
percentileValue: (p: Int, data: List[Int])Int
scala> val lst=List(1,2,3,4,5,6)
lst: List[Int] = List(1, 2, 3, 4, 5, 6)
scala> percentileValue(60,lst)
res142: Int = 4
答案 0 :(得分:0)
缩短数据以便更好地进行测试。那样的东西?
val lili = List (List ("yuhindmklwm004092193", "9581936283", "0305delete3", 2700),
List ("yuhindmklwm004092193", "9581936283", "0305delete3", 800),
List ("yuhindmklwm004092193", "9592624463", "0219delete3", 62),
List ("yuhindmklwm004372933", "7520138013", "0220delete3", 2700),
List ("yuhindmklwm004372933", "853823", "0126delete3", 500),
List ("yuhindmklwm004372933", "8630565143", "0326delete3", -2700),
List ("yuhindmklwm004372933", "8630565143", "0326delete3", 2700),
List ("yuhindmklwm004372933", "852583", "0313delete3", 1000))
分组:
scala> lili.groupBy {case (list) => list(0) }.map {case (k, v) => (k, v.map (_(3)))}
res18: scala.collection.immutable.Map[Any,List[Any]] = Map(yuhindmklwm004372933 -> List(2700, 500, -2700, 2700, 1000), yuhindmklwm004092193 -> List(2700, 800, 62))
在percentileValue上的映射:
lili.groupBy {case (list) => list(0) }.map {case (k, v) => (k, v.map (_(3)))}.map {case (k, v:List[Int])=> (k, percentileValue (60, v))}
<console>:10: warning: non-variable type argument Int in type pattern List[Int] (the underlying of List[Int]) is unchecked since it is eliminated by erasure
lili.groupBy {case (list) => list(0) }.map {case (k, v) => (k, v.map (_(3)))}.map {case (k, v:List[Int])=> (k, percentileValue (60, v))}
^
res22: scala.collection.immutable.Map[Any,Int] = Map(yuhindmklwm004372933 -> 2700, yuhindmklwm004092193 -> 2700)
scala> lili.groupBy {case (list) => list(0) }.map {case (k, v) => (k, v.map (_(3)))}.map {case (k, v:List[Int])=> (k, percentileValue (10, v))}
<console>:10: warning: non-variable type argument Int in type pattern List[Int] (the underlying of List[Int]) is unchecked since it is eliminated by erasure
lili.groupBy {case (list) => list(0) }.map {case (k, v) => (k, v.map (_(3)))}.map {case (k, v:List[Int])=> (k, percentileValue (10, v))}
^
res23: scala.collection.immutable.Map[Any,Int] = Map(yuhindmklwm004372933 -> 500, yuhindmklwm004092193 -> 800)