Question

我有一个由位置，类型和日期组成的数据框。

d1=sc.parallelize([('a','b1','2015-01-07'), ('a','b2','2015-02-09'),
              ('c', 'b25', '2014-12-01'),('a','b2','2014-09-10'),
              ('c', 'b3','2015-02-21'),('a','b11','2015-09-12'),
              ('a','b7','2014-11-30'), ('c','b26','2014-03-09'),
              ('c', 'b30', '2015-11-28'),('a', 'b5', '2015-03-01'),
              ('c','b25','2015-11-29'),('c', 'b27','2014-01-17'),
              ('c', 'b16','2015-04-01'), ('a', 'b11','2014-01-19'),
              ('a','b7', '2015-09-29'), ('c', 'b12', '2014-08-20')]).toDF(['location',
                'type', 'date_str'])

d2=d1.withColumn('date',d1.date_str.cast('date')).drop('date_str')



|loaction|type|      date|
+--------+----+----------+
|       a|  b1|2015-01-07|
|       a|  b2|2015-02-09|
|       c| b25|2014-12-01|
|       a|  b2|2014-09-10|
|       c|  b3|2015-02-21|
|       a| b11|2015-09-12|
|       a|  b7|2014-11-30|
|       c| b26|2014-03-09|
|       c| b30|2015-11-28|
|       a|  b5|2015-03-01|
|       c| b25|2015-11-29|
|       c| b27|2014-01-17|
|       c| b16|2015-04-01|
|       a| b11|2014-01-19|
|       a|  b7|2015-09-29|
|       c| b12|2014-08-20|
+--------+----+----------+

我想获得2014年某个特定地点的类型百分比，并且2015年也出现在同一地点。

在这种情况下，2014年位置'a'有3种不同的类型， b2，b7和b11。 2015年有5种不同的类型位置'a'，b1，b2，b11，b5和b7。所以出了三种类型谁是在2014年，所有三个也在2015年，即100％（3 3）在2015年。

对于位置'c'，2014年有四种类型，b25，b26，b27和b12。 2015年有b30，b25，b16三种类型。 2014年的类型数量在2015年也见过的人是25％（4人中的1人）。

我想找回一个看起来像这样的DataFrame

|location|percent_retained|
+--------+----------------+
|       a|               1|
|       c|             .25|
+--------+----------------+

我可以做一个小组来获取每年的原始数量，但这没有用，因为我只想知道2015年出现的类型数量。2014年。

d2=d2.withColumn('year', F.year(d2.date))
d2.groupBy('location', 'year').agg('type':'count')

我正在使用Spark 1.5，因此无法透过DataFrame。

Answer 1

按照您的方式提取年份，然后按location和type分组。然后，您可以映射年份列表（每个位置，每种类型）并应用您想要的逻辑。

根据我的理解，你可以有一年或两年。你可以把它翻译成一个标志，比如＆＃34;保留＆＃34;，＆＃34;只是2014＆＃34;，＆＃34;只是2015＆＃34;并从那里做剩余的计算。

使用约束

1 个答案: