考虑以下数据框df:
import pandas as pd
d = {"A":[3, 3, 3, 2, 3, 3, 2, 2, 2, 3, 3, 2], "B": [3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 3, 3]}
df = pd.DataFrame.from_dict(d)
我有兴趣确定每列的值等于2的句点。具体来说,我想打印一条消息,指示何时(索引)值2已出现以及多长时间(再次以索引表示)该值保持为2而忽略单次出现。因此,对于上述数据帧,答案应如下所示:
Column A: Value 2 was observed at instance 6 and continued till instance 8.
Column B: Value 2 was observed at instance 9 and continued till instance 10.
我可以用whiles和for循环来做这个,但是有没有pythonic方法呢?任何帮助表示赞赏。
答案 0 :(得分:2)
使用numpy,一个可能的解决方案将是以下(主要基于"central directory")。
import pandas as pd
d = {"A":[3, 3, 3, 2, 3, 3, 2, 2, 2, 3, 3, 2], "B": [3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 3, 3]}
df = pd.DataFrame.from_dict(d)
import numpy as np
def runs_of_ones_array(bits):
# make sure all runs of ones are well-bounded
bounded = np.hstack(([0], bits, [0]))
# get 1 at run starts and -1 at run ends
difs = np.diff(bounded)
run_starts, = np.where(difs > 0)
run_ends, = np.where(difs < 0)
return np.vstack((run_starts, run_ends)).T
interesting_value = 2
runs = runs_of_ones_array(df["A"] == interesting_value)
for start, end in runs:
end -= 1
# since we don't seem to be interested in single-element runs
if start == end:
continue
print("Value {} was observed at instance {} and continued till instance {}.".format(
interesting_value, start, end))
以上的输出是
Value 2 was observed at instance 6 and continued till instance 8.
编辑:将代码修改为仅输出长度大于1的运行。
EDIT2:关于两个发布的非常相似的方法的速度,我在IPython中运行了一些基准测试
EDIT3:如果在基准测试中包含布尔掩码生成时间,groupby
方法的表现优于其他方法几乎一个数量级
In [28]:
%%timeit -n 10000
mask = df == 2
for col_name in mask:
column = mask[col_name]
runs = runs_of_ones_array(column)
for start, end in runs:
end -= 1
if start == end:
continue
pass
10000 loops, best of 3: 452 µs per loop
In [29]:
%%timeit -n 10000
mask = df == 2
for col_name in mask:
column = mask[col_name]
ind = column[column].index.values
for sub in np.split(ind, np.where(np.diff(ind) != 1)[0]+1):
if sub.size > 1:
pass
pass
10000 loops, best of 3: 585 µs per loop
In [30]:
from itertools import groupby
In [31]:
%%timeit -n 10000
for k in df:
ind = prev = 0
for k, v in groupby(df[k], key=lambda x: x == 2):
ind += sum(1 for _ in v)
if k and prev + 1 != ind:
pass
prev = ind
10000 loops, best of 3: 73.4 µs per loop
答案 1 :(得分:2)
你可以拆分:
import pandas as pd
d = {"A":[3, 3, 3, 2, 3, 3, 2, 2, 2, 3, 3, 2], "B": [3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 3, 3]}
df = pd.DataFrame.from_dict(d)
mask = (df == 2) & (df.shift() == 2)
inds_a = mask["A"][mask["A"]].index.values
inds_b = mask["B"][mask["B"]].index.values
for ind in [inds_a, inds_b]:
for sub in np.split(ind, np.where(np.diff(ind) != 1)[0]+1):
print("2 appeared at {} to {}".format(sub[0]-1, sub[-1]))
获取索引并在拆分中过滤可能更快:
mask = df == 2
inds_a = mask.A[mask.A].index.values
inds_b = mask.B[mask.B].index.values
for ind in [inds_a, inds_b]:
for sub in np.split(ind, np.where(np.diff(ind) != 1)[0]+1):
if sub.size > 1:
print("2 appeared at {} to {}".format(sub[0], sub[-1]))
输出:
2 appeared at 6 to 8
2 appeared at 8 to 9
有趣的是,我发现使用itertools.groupby
实际上是最快的:
from itertools import groupby
for k in df:
ind = prev = 0
for k, v in groupby(df[k], key=lambda x: x == 2):
ind += sum(1 for _ in v)
if k and prev + 1 != ind:
print("2 appeared at {} to {}".format(prev, ind - 1))
prev = ind
输出:
2 appeared at 6 to 8
2 appeared at 8 to 9