Question

x = pd.DataFrame(index = pd.date_range(start="2017-1-1", end="2017-1-13"), 
columns="a b c".split())
x.ix[0:2, "a"] = 1
x.ix[5:10, "a"] = 1
x.ix[9:12, "b"] = 1
x.ix[1:3, "c"] = 1
x.ix[5, "c"] = 1

            a   b   c
2017-01-01  1   NaN NaN
2017-01-02  1   NaN 1
2017-01-03  NaN NaN 1
2017-01-04  NaN NaN NaN
2017-01-05  NaN NaN NaN
2017-01-06  1   NaN 1
2017-01-07  1   NaN NaN
2017-01-08  1   NaN NaN
2017-01-09  1   NaN NaN
2017-01-10  1   1   NaN
2017-01-11  NaN 1   NaN
2017-01-12  NaN 1   NaN
2017-01-13  NaN NaN NaN

鉴于以上数据框x，我想返回a，b和c每组中1的出现次数的平均值。每列的平均值取自包含连续1的块的数量。

例如，列a将输出平均值2和5，即3.5。我们将它除以2，因为在1月1日到1月2日之间有2个连续的1，然后在1月6日到1月1日之间连续5个1，总共2个1。类似地，对于列b，我们将得到3，因为在1月10日到1月13日之间只出现一个连续的1s序列。最后，对于列c，我们将得到平均值2和1，即1.5。

玩具示例的预期输出：

a    b  c
3.5  3  1.5

Answer 1

将mask + apply与value_counts一起使用，最后找到您的mean点数 -

x.eq(1)\
 .ne(x.eq(1).shift())\
 .cumsum(0)\
 .mask(x.ne(1))\
 .apply(pd.Series.value_counts)\
 .mean(0)

a    3.5
b    3.0
c    1.5
dtype: float64

<强>详情

首先，在数据框中找到所有连续值的列表 -

i = x.eq(1).ne(x.eq(1).shift()).cumsum(0)
i

            a  b  c
2017-01-01  1  1  1
2017-01-02  1  1  2
2017-01-03  2  1  2
2017-01-04  2  1  3
2017-01-05  2  1  3
2017-01-06  3  1  4
2017-01-07  3  1  5
2017-01-08  3  1  5
2017-01-09  3  1  5
2017-01-10  3  2  5
2017-01-11  4  2  5
2017-01-12  4  2  5
2017-01-13  4  3  5

现在，只保留1中x的那些组值

j = i.mask(x.ne(1))
j

              a    b    c
2017-01-01  1.0  NaN  NaN
2017-01-02  1.0  NaN  2.0
2017-01-03  NaN  NaN  2.0
2017-01-04  NaN  NaN  NaN
2017-01-05  NaN  NaN  NaN
2017-01-06  3.0  NaN  4.0
2017-01-07  3.0  NaN  NaN
2017-01-08  3.0  NaN  NaN
2017-01-09  3.0  NaN  NaN
2017-01-10  3.0  2.0  NaN
2017-01-11  NaN  2.0  NaN
2017-01-12  NaN  2.0  NaN
2017-01-13  NaN  NaN  NaN

现在，在每列中应用value_counts -

k = j.apply(pd.Series.value_counts)
k


       a    b    c
1.0  2.0  NaN  NaN
2.0  NaN  3.0  2.0
3.0  5.0  NaN  NaN
4.0  NaN  NaN  1.0

然后找到列式方法 -

k.mean(0)

a    3.5
b    3.0
c    1.5
dtype: float64

作为一个方便的说明，例如，如果您想要查找平均计数仅超过n连续1 s（例如，此处为n = 1），那么您可以很容易过滤k的索引 -

k[k.index > 1].mean(0)

a    5.0
b    3.0
c    1.5
dtype: float64

Answer 2

试试吧：

x.apply(lambda s: s.groupby(s.ne(1).cumsum()).sum().mean())

输出：

a    3.5
b    3.0
c    1.5
dtype: float64

将lambda函数应用于数据帧的每一列。 lambda函数将none 1值组合在一起并使用sum（）计算它们，然后使用mean（）获取平均值。

Answer 3

这会使用cumsum，shift和xor掩码。

b = x.cumsum()  
c = b.shift(-1)
b_masked = b[b.isnull() ^ c.isnull()]

b_masked.max() / b_masked.count()

a    3.5
b    3.0
c    1.5
dtype: float64

首先做b = x.cumsum()

    a       b       c
0   1.0     NaN     NaN
1   2.0     NaN     1.0
2   NaN     NaN     2.0
3   NaN     NaN     NaN
4   NaN     NaN     NaN
5   3.0     NaN     3.0
6   4.0     NaN     NaN
7   5.0     NaN     NaN
8   6.0     NaN     NaN
9   7.0     1.0     NaN
10  NaN     2.0     NaN
11  NaN     3.0     NaN
12  NaN     NaN     NaN

然后，向上移动b：c = b.shift(-1)。然后，我们使用b.isnull() ^ c.isnull()创建一个xor蒙版。此掩码仅保留每个连续值的一个值。请注意，它最终会创建一个额外的True。但是因为我们把它放回到b，在那里它是NaN，它不会产生新的元素。我们用一个例子来说明

 b   c   b.isnull() ^ c.isnull()    b[b.isnull() ^ c.isnull()]
NaN  1         True                          NaN
 1   2         False                         NaN
 2  NaN        True                          2
NaN NaN        False                         NaN

真正的大b[b.isnull() ^ c.isnull()]看起来像

    a       b        c
0   NaN     NaN     NaN
1   2.0     NaN     NaN
2   NaN     NaN     2.0
3   NaN     NaN     NaN
4   NaN     NaN     NaN
5   NaN     NaN     3.0
6   NaN     NaN     NaN
7   NaN     NaN     NaN
8   NaN     NaN     NaN
9   7.0     NaN     NaN
10  NaN     NaN     NaN
11  NaN     3.0     NaN
12  NaN     NaN     NaN

因为我们首先做了cumsum，所以我们只需要每列中的最大值和非NaN数来计算均值。

因此，我们b[b.isnull() ^ c.isnull()].max() / b[b.isnull() ^ c.isnull()].count()

Answer 4

你可以使用正则表达式：

import re

p = r'1+'

counts = {
    c: np.mean(
        [len(x) for x in re.findall(p, ''.join(map(str, x[c].values)))]
        )
    for c in ['a', 'b', 'c']
}

此方法有效，因为此处的列可以被视为具有字母{1，nan}的语言中的表达式。 1+匹配相邻1的所有组，re.findall返回字符串列表。然后，有必要计算每个字符串长度的平均值。

如何按列计算连续1的出现次数并按块进行计算

4 个答案: