我是熊猫的新手。我正在使用 Pandas 将时间戳记录的 CSV 文件读取到数据框中。数据具有以下列:
<块引用>时间戳 COLUMN_A COLUMN_B COLUMN_C
将数据读入数据帧后,我希望能够在 COLUMN_C 上运行窗口函数;该函数应返回列的时间戳值。
我写了一些适用于迭代的东西:
import collections
import itertools
def sliding_window_iter(iterable, size):
"""Iterate through iterable using a sliding window of several elements.
Creates an iterable where each element is a tuple of `size`
consecutive elements from `iterable`, advancing by 1 element each
time. For example:
>>> list(sliding_window_iter([1, 2, 3, 4], 2))
[(1, 2), (2, 3), (3, 4)]
"""
iterable = iter(iterable)
window = collections.deque(
itertools.islice(iterable, size-1),
maxlen=size
)
for item in iterable:
window.append(item)
yield tuple(window)
如何修改它以处理数据框的列?
答案 0 :(得分:2)
连续切片数据帧更简单。既然你想要重叠窗口 [(1, 2), (2, 3), (3, 4), ...]
,你可以这样写:
def sliding_window_iter(series, size):
"""series is a column of a dataframe"""
for start_row in range(len(series) - size + 1):
yield series[start_row:start_row + size]
用法:
df = pd.DataFrame({'A': list(range(100, 501, 100)),
'B': list(range(-20, -15)),
'C': [0, 1, 2, None, 4]},
index=pd.date_range('2021-01-01', periods=5))
list(sliding_window_iter(df['C'], 2))
输出:
[2021-01-01 0.0
2021-01-02 1.0
Freq: D, Name: C, dtype: float64,
2021-01-02 1.0
2021-01-03 2.0
Freq: D, Name: C, dtype: float64,
2021-01-03 2.0
2021-01-04 NaN
Freq: D, Name: C, dtype: float64,
2021-01-04 NaN
2021-01-05 4.0
Freq: D, Name: C, dtype: float64]
如果您传入多列也有效:
list(sliding_window_iter(df.loc[:, ['A', 'C']], 2))
#output:
[ A C
2021-01-01 100 0.0
2021-01-02 200 1.0,
A C
2021-01-02 200 1.0
2021-01-03 300 2.0,
A C
2021-01-03 300 2.0
2021-01-04 400 NaN,
A C
2021-01-04 400 NaN
2021-01-05 500 4.0]