我有一个熊猫DataFrame,需要将其以n行的块的形式馈入下游函数(在示例中为print
)。这些块可能有重叠的行。
让我们从一个虚拟的DataFrame开始:
d = {'A':list(range(1000)), 'B':list(range(1000))}
df=pd.DataFrame(d)
在2行块与1行重叠的情况下,我有以下代码:
a = df.index.values[:-1]
for i in a:
print(df.iloc[i:i+2])
输出是这样的:
...
A B
996 996 996
997 997 997
A B
997 997 997
998 998 998
A B
998 998 998
999 999 999
这正是我想要的。
是否有更好/更快的方法来遍历pandas.DataFrame的n行块?
答案 0 :(得分:3)
使用DataFrame.groupby
进行整数除法,并创建具有与df
相同长度的辅助1d数组-索引值不重叠:
d = {'A':list(range(5)), 'B':list(range(5))}
df=pd.DataFrame(d)
print (np.arange(len(df)) // 2)
[0 0 1 1 2]
for i, g in df.groupby(np.arange(len(df)) // 2):
print (g)
A B
0 0 0
1 1 1
A B
2 2 2
3 3 3
A B
4 4 4
编辑:
对于重叠的值,请编辑this answer:
def chunker1(seq, size):
return (seq.iloc[pos:pos + size] for pos in range(0, len(seq)-1))
for i in chunker1(df,2):
print (i)
A B
0 0 0
1 1 1
A B
1 1 1
2 2 2
A B
2 2 2
3 3 3
A B
3 3 3
4 4 4
答案 1 :(得分:1)
带有控制参数重叠的 step 参数的块函数的生成器版本如下所示。此外,此版本还可以使用pd.DataFrame或pd.Series的自定义索引(例如float类型索引)。为了更加方便(检查重叠),此处使用整数索引。
sz = 14
# ind = np.linspace(0., 10., num=sz)
ind = range(sz)
df = pd.DataFrame(np.random.rand(sz,4),
index=ind,
columns=['a', 'b', 'c', 'd'])
def chunker(seq, size, overlap):
for pos in range(0, len(seq), size-overlap):
yield seq.iloc[pos:pos + size]
chunk_size = 6
chunk_overlap = 2
for i in chunker(df, chunk_size, chunk_overlap):
print(i)
chnk = chunker(df, chunk_size, chunk_overlap)
print('\n', chnk, end='\n\n')
print('First "next()":', next(chnk), sep='\n', end='\n\n')
print('Second "next()":', next(chnk), sep='\n', end='\n\n')
print('Third "next()":', next(chnk), sep='\n', end='\n\n')
重叠大小= 2
的输出a b c d 0 0.577076 0.025997 0.692832 0.884328 1 0.504888 0.575851 0.514702 0.056509 2 0.880886 0.563262 0.292375 0.881445 3 0.360011 0.978203 0.799485 0.409740 4 0.774816 0.332331 0.809632 0.675279 5 0.453223 0.621464 0.066353 0.083502 a b c d 4 0.774816 0.332331 0.809632 0.675279 5 0.453223 0.621464 0.066353 0.083502 6 0.985677 0.110076 0.724568 0.990237 7 0.109516 0.777629 0.485162 0.275508 8 0.765256 0.226010 0.262838 0.758222 9 0.805593 0.760361 0.833966 0.024916 a b c d 8 0.765256 0.226010 0.262838 0.758222 9 0.805593 0.760361 0.833966 0.024916 10 0.418790 0.305439 0.258288 0.988622 11 0.978391 0.013574 0.427689 0.410877 12 0.943751 0.331948 0.823607 0.847441 13 0.359432 0.276289 0.980688 0.996048 a b c d 12 0.943751 0.331948 0.823607 0.847441 13 0.359432 0.276289 0.980688 0.996048 First "next()": a b c d 0 0.577076 0.025997 0.692832 0.884328 1 0.504888 0.575851 0.514702 0.056509 2 0.880886 0.563262 0.292375 0.881445 3 0.360011 0.978203 0.799485 0.409740 4 0.774816 0.332331 0.809632 0.675279 5 0.453223 0.621464 0.066353 0.083502 Second "next()": a b c d 4 0.774816 0.332331 0.809632 0.675279 5 0.453223 0.621464 0.066353 0.083502 6 0.985677 0.110076 0.724568 0.990237 7 0.109516 0.777629 0.485162 0.275508 8 0.765256 0.226010 0.262838 0.758222 9 0.805593 0.760361 0.833966 0.024916 Third "next()": a b c d 8 0.765256 0.226010 0.262838 0.758222 9 0.805593 0.760361 0.833966 0.024916 10 0.418790 0.305439 0.258288 0.988622 11 0.978391 0.013574 0.427689 0.410877 12 0.943751 0.331948 0.823607 0.847441 13 0.359432 0.276289 0.980688 0.996048