给出这样的输入:
timestamp vars
2 [1,2,3]
2 [1,2,4]
3 [1,2]
4 [1,3]
5 [1,3]
我需要对每个索引进行滚动计数。尝试将数组扩展为一个热编码([1,2,3,5]-> [0,1,1,1,0,1])并添加,但这会变得任意大(> 1百万),所以我想保留它作为命令。像下面这样。任何指针将不胜感激。
timestamp vars
2 {1:1, 2:1, 3:1}
2 {1:2, 2:2, 3:1, 4:1}
3 {1:3, 2:3, 3:1, 4:1}
4 {1:4, 2:3, 3:2, 4:1}
5 {1:5, 2:3, 3:3, 4:1}
谢谢!
答案 0 :(得分:0)
示例数据框:
+---+------------+
| ID| arr|
+---+------------+
| 1| [0]|
| 2| [0, 1]|
| 3| [0, 1, 2]|
| 4|[0, 1, 2, 3]|
| 1| [0]|
| 1| [0]|
| 3| [0, 1, 2]|
| 0| []|
+---+------------+
使用以下使用集合计数器的功能:
def arr_operation(arr):
from collections import Counter
return dict(Counter(arr))
通过以下方式为arr_operation
函数创建UDF:
udf_dist_count = udf(arr_operation,MapType(IntegerType(), IntegerType()))
并调用创建一个新列:
final_df = df.withColumn("Dict",udf_dist_count("arr"))
结果将类似于:
+---+------------+--------------------------------+
|ID |arr |Dict |
+---+------------+--------------------------------+
|1 |[0] |[0 -> 1] |
|2 |[0, 1] |[0 -> 1, 1 -> 1] |
|3 |[0, 1, 2] |[0 -> 1, 1 -> 1, 2 -> 1] |
|4 |[0, 1, 2, 3]|[0 -> 1, 1 -> 1, 2 -> 1, 3 -> 1]|
|1 |[0] |[0 -> 1] |
|1 |[0] |[0 -> 1] |
|3 |[0, 1, 2] |[0 -> 1, 1 -> 1, 2 -> 1] |
|0 |[] |[] |
+---+------------+--------------------------------+
在对问题Why is Collections.counter so slow?的回答中,很好地说明了关于收集计数器在分布式环境中运行缓慢的说法。
答案 1 :(得分:-1)
我建议Counter
中的collections
:
In [1]: from collections import Counter
In [2]: count = Counter()
In [3]: count.update([1,2,4])
In [4]: count
Out[4]: Counter({1: 1, 2: 1, 4: 1})
In [5]: count.update([1,2,3])
In [6]: count
Out[6]: Counter({1: 2, 2: 2, 4: 1, 3: 1})
In [7]: count.update([2,3,5])
In [8]: count
Out[8]: Counter({1: 2, 2: 3, 4: 1, 3: 2, 5: 1})