识别数据框

时间:2015-12-14 11:58:05

标签: python pandas

考虑以下数据框df:

import pandas as pd
d = {"A":[3, 3, 3, 2, 3, 3, 2, 2, 2, 3, 3, 2], "B": [3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 3, 3]}
df = pd.DataFrame.from_dict(d)

我有兴趣确定每列的值等于2的句点。具体来说,我想打印一条消息,指示何时(索引)值2已出现以及多长时间(再次以索引表示)该值保持为2而忽略单次出现。因此,对于上述数据帧,答案应如下所示:

Column A: Value 2 was observed at instance 6 and continued till instance 8.
Column B: Value 2 was observed at instance 9 and continued till instance 10.

我可以用whiles和for循环来做这个,但是有没有pythonic方法呢?任何帮助表示赞赏。

2 个答案:

答案 0 :(得分:2)

使用numpy,一个可能的解决方案将是以下(主要基于"central directory")。

import pandas as pd
d = {"A":[3, 3, 3, 2, 3, 3, 2, 2, 2, 3, 3, 2], "B": [3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 3, 3]}
df = pd.DataFrame.from_dict(d)

import numpy as np

def runs_of_ones_array(bits):
    # make sure all runs of ones are well-bounded
    bounded = np.hstack(([0], bits, [0]))
    # get 1 at run starts and -1 at run ends
    difs = np.diff(bounded)
    run_starts, = np.where(difs > 0)
    run_ends, = np.where(difs < 0)
    return np.vstack((run_starts, run_ends)).T

interesting_value = 2
runs = runs_of_ones_array(df["A"] == interesting_value)
for start, end in runs:
    end -= 1
    # since we don't seem to be interested in single-element runs
    if start == end:
        continue
    print("Value {} was observed at instance {} and continued till instance {}.".format(
        interesting_value, start, end))

以上的输出是

Value 2 was observed at instance 6 and continued till instance 8.

编辑:将代码修改为仅输出长度大于1的运行。

EDIT2:关于两个发布的非常相似的方法的速度,我在IPython中运行了一些基准测试

EDIT3:如果在基准测试中包含布尔掩码生成时间,groupby方法的表现优于其他方法几乎一个数量级

In [28]:
%%timeit -n 10000
mask = df == 2
for col_name in mask:
    column = mask[col_name]
    runs = runs_of_ones_array(column)
    for start, end in runs:
        end -= 1
        if start == end:
            continue
        pass
10000 loops, best of 3: 452 µs per loop

In [29]:
%%timeit -n 10000
mask = df == 2
for col_name in mask:
    column = mask[col_name]
    ind = column[column].index.values
    for sub in np.split(ind, np.where(np.diff(ind) != 1)[0]+1):
        if sub.size > 1:
            pass
        pass
10000 loops, best of 3: 585 µs per loop

In [30]:
from itertools import groupby

In [31]:
%%timeit -n 10000
for k in df:
    ind = prev = 0
    for k, v in groupby(df[k], key=lambda x: x == 2):
        ind += sum(1 for _ in v)
        if k and prev + 1 != ind:
            pass
        prev = ind
10000 loops, best of 3: 73.4 µs per loop

答案 1 :(得分:2)

你可以拆分:

import pandas as pd
d = {"A":[3, 3, 3, 2, 3, 3, 2, 2, 2, 3, 3, 2], "B": [3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 3, 3]}
df = pd.DataFrame.from_dict(d)

mask = (df == 2) & (df.shift() == 2)

inds_a = mask["A"][mask["A"]].index.values
inds_b = mask["B"][mask["B"]].index.values

for ind in [inds_a, inds_b]:
    for sub in np.split(ind,   np.where(np.diff(ind) != 1)[0]+1):
        print("2 appeared at {} to {}".format(sub[0]-1, sub[-1]))

获取索引并在拆分中过滤可能更快:

mask = df == 2
inds_a = mask.A[mask.A].index.values
inds_b = mask.B[mask.B].index.values


for ind in [inds_a, inds_b]:
    for sub in np.split(ind,   np.where(np.diff(ind) != 1)[0]+1):
        if sub.size > 1:
            print("2 appeared at {} to {}".format(sub[0], sub[-1]))

输出:

2 appeared at 6 to 8
2 appeared at 8 to 9

有趣的是,我发现使用itertools.groupby实际上是最快的:

from itertools import groupby

for k in df:
    ind = prev = 0
    for k, v in groupby(df[k], key=lambda x: x == 2):
        ind += sum(1 for _ in v)
        if k and prev + 1 != ind:
            print("2 appeared at {} to {}".format(prev, ind - 1))
        prev = ind

输出:

2 appeared at 6 to 8
2 appeared at 8 to 9