我试图计算熊猫数据框中每列的连续正事件数。 DSM此处Counting consecutive positive value in Python array提供的解决方案适用于给定的系列。
...
for (...){
Parts = *Generate some RDDs of Integers with some GBs*
}
// Create a huge RDD of Array[Int]
var rdd = sc.union(Parts).persist(StorageLevel.MEMORY_AND_DISK_SER)
// Repartition
rdd.coalesce(partitionsNum).persist(StorageLevel.MEMORY_AND_DISK_SER)
// Sort
var rdd = rdd.sortBy(x => x, numPartitions = partitions).persist(StorageLevel.MEMORY_AND_DISK_SER)
array([0,1,0,1,2,0,0,0,1,2,2,0,1,0],dtype = int64)
但是,当我尝试对具有几列的数据框执行此操作时,会得到以下内容。
import pandas as pd
a = [0,1,0,1,1,0,0,0,1,1,0,1,0]
b = [0,0,0,0,1,1,0,1,1,1,0,0,0]
series = pd.Series(a)
consecutiveCount(series).values
如果我遍历每一列,它可以工作,但是非常慢。是否有矢量化的方式可以立即处理整个数据帧?
谢谢!
答案 0 :(得分:0)
您可以尝试apply
方法。这样可能会给您带来更好的结果:
df.apply(consecutiveCount)
答案 1 :(得分:0)
在未堆叠系列中,仅一次使用consecutiveCounts
。然后,堆叠回数据框。
使用DSM的consecutiveCount
,为简单起见,我在这里将其命名为c
:
>>> c = lambda y: y * (y.groupby((y != y.shift()).cumsum()).cumcount() + 1)
>>> c(df.unstack()).unstack().T
a b
0 0 0
1 1 0
2 0 0
3 1 0
4 2 1
5 0 2
6 0 0
7 0 1
8 1 2
9 2 3
10 0 0
11 1 0
12 0 0
时间
# df2 is (65, 40)
df2 = pd.concat([pd.concat([df]*20, axis=1)]*5).T.reset_index(drop=True).T.reset_index(drop=True)
%timeit c(df2.unstack()).unstack().T
5.54 ms ± 296 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df2.apply(c)
82.5 ms ± 2.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
答案 2 :(得分:0)
改编自@cs95's answer:
a = pd.Series([-1, 2, 15, 3, 45, 5, 23, 0, 6, -4, -8, -5, 3,
-9, -7, -36, -71, -2, 25, 47, -8])
def pos_neg_count(a):
v = a.ge(0).ne(a.ge(0).shift()).cumsum()
vals = v.groupby(v).count().values
cols = ['pos', 'neg'] if a[0] >= 0 else ['neg', 'pos']
try:
result = pd.DataFrame(vals.reshape(-1, 2), columns=cols)
except ValueError:
vals = np.insert(vals, len(vals), 0)
result = pd.DataFrame(vals.reshape(-1, 2), columns=cols)
return result
pos_neg_count(a)
# neg pos
# 0 1 8
# 1 3 1
# 2 5 2
# 3 1 0