大熊猫重叠重叠一次遍历多行

时间:2019-06-18 10:55:57

标签: python pandas iteration

我有一个熊猫DataFrame,需要将其以n行的块的形式馈入下游函数(在示例中为print)。这些块可能有重叠的行。

让我们从一个虚拟的DataFrame开始:

d = {'A':list(range(1000)), 'B':list(range(1000))}
df=pd.DataFrame(d)

在2行块与1行重叠的情况下,我有以下代码:

a = df.index.values[:-1]
for i in a:
    print(df.iloc[i:i+2])

输出是这样的:

...
       A    B
996  996  996
997  997  997
       A    B
997  997  997
998  998  998
       A    B
998  998  998
999  999  999

这正是我想要的。

是否有更好/更快的方法来遍历pandas.DataFrame的n行块?

2 个答案:

答案 0 :(得分:3)

使用DataFrame.groupby进行整数除法,并创建具有与df相同长度的辅助1d数组-索引值不重叠:

d = {'A':list(range(5)), 'B':list(range(5))}
df=pd.DataFrame(d)

print (np.arange(len(df)) // 2)
[0 0 1 1 2]

for i, g in df.groupby(np.arange(len(df)) // 2):
    print (g)

   A  B
0  0  0
1  1  1
   A  B
2  2  2
3  3  3
   A  B
4  4  4

编辑:

对于重叠的值,请编辑this answer

def chunker1(seq, size):
    return (seq.iloc[pos:pos + size] for pos in range(0, len(seq)-1))

for i in chunker1(df,2):
    print (i)

   A  B
0  0  0
1  1  1
   A  B
1  1  1
2  2  2
   A  B
2  2  2
3  3  3
   A  B
3  3  3
4  4  4

答案 1 :(得分:1)

重叠的生成器函数用于迭代熊猫数据框和序列

带有 overlap 参数的块函数,用于控制 overlap 因素

带有控制参数重叠 step 参数的块函数的生成器版本如下所示。此外,此版本还可以使用pd.DataFrame或pd.Series的自定义索引(例如float类型索引)。为了更加方便(检查重叠),此处使用整数索引。

   sz = 14
   # ind = np.linspace(0., 10., num=sz)
   ind = range(sz)

   df = pd.DataFrame(np.random.rand(sz,4),
                     index=ind,
                     columns=['a', 'b', 'c', 'd'])

   def chunker(seq, size, overlap):
       for pos in range(0, len(seq), size-overlap):
           yield seq.iloc[pos:pos + size] 

   chunk_size = 6
   chunk_overlap = 2
   for i in chunker(df, chunk_size, chunk_overlap):
       print(i)

   chnk = chunker(df, chunk_size, chunk_overlap)
   print('\n', chnk, end='\n\n')
   print('First "next()":', next(chnk), sep='\n', end='\n\n')
   print('Second "next()":', next(chnk), sep='\n', end='\n\n')
   print('Third "next()":', next(chnk), sep='\n', end='\n\n')

重叠大小= 2

的输出
          a         b         c         d
0  0.577076  0.025997  0.692832  0.884328
1  0.504888  0.575851  0.514702  0.056509
2  0.880886  0.563262  0.292375  0.881445
3  0.360011  0.978203  0.799485  0.409740
4  0.774816  0.332331  0.809632  0.675279
5  0.453223  0.621464  0.066353  0.083502
          a         b         c         d
4  0.774816  0.332331  0.809632  0.675279
5  0.453223  0.621464  0.066353  0.083502
6  0.985677  0.110076  0.724568  0.990237
7  0.109516  0.777629  0.485162  0.275508
8  0.765256  0.226010  0.262838  0.758222
9  0.805593  0.760361  0.833966  0.024916
           a         b         c         d
8   0.765256  0.226010  0.262838  0.758222
9   0.805593  0.760361  0.833966  0.024916
10  0.418790  0.305439  0.258288  0.988622
11  0.978391  0.013574  0.427689  0.410877
12  0.943751  0.331948  0.823607  0.847441
13  0.359432  0.276289  0.980688  0.996048
           a         b         c         d
12  0.943751  0.331948  0.823607  0.847441
13  0.359432  0.276289  0.980688  0.996048

 

First "next()":
          a         b         c         d
0  0.577076  0.025997  0.692832  0.884328
1  0.504888  0.575851  0.514702  0.056509
2  0.880886  0.563262  0.292375  0.881445
3  0.360011  0.978203  0.799485  0.409740
4  0.774816  0.332331  0.809632  0.675279
5  0.453223  0.621464  0.066353  0.083502

Second "next()":
          a         b         c         d
4  0.774816  0.332331  0.809632  0.675279
5  0.453223  0.621464  0.066353  0.083502
6  0.985677  0.110076  0.724568  0.990237
7  0.109516  0.777629  0.485162  0.275508
8  0.765256  0.226010  0.262838  0.758222
9  0.805593  0.760361  0.833966  0.024916

Third "next()":
           a         b         c         d
8   0.765256  0.226010  0.262838  0.758222
9   0.805593  0.760361  0.833966  0.024916
10  0.418790  0.305439  0.258288  0.988622
11  0.978391  0.013574  0.427689  0.410877
12  0.943751  0.331948  0.823607  0.847441
13  0.359432  0.276289  0.980688  0.996048