从pandas数据帧中选择值

时间:2015-12-18 20:45:48

标签: python list pandas

我试图从一系列(数百个)输入列表中一次运行一次需要多个输入(10)的计算。

我有一个随机值的数据框:

s = pd.Series(np.random.randint(0,1000,size=240))

我想获取前10个值并将它们粘贴在列表中,然后运行计算。然后我想从数据帧中获取接下来的10个值,并将它们添加到新列表中并再次运行该计算。

你是如何在熊猫中做到这一点的?

3 个答案:

答案 0 :(得分:2)

或者,您可以使用generator

s = pd.Series(np.random.randint(0,1000,size=240))

def chunks(l, n):
    for i in range(0, len(l), n):
        yield l[i:i + n]

c = chunks(s.tolist(), 10)

print(next(c))
[198, 854, 363, 818, 664, 983, 110, 333, 428, 801]

print(next(c))
[711, 973, 938, 518, 765, 739, 59, 546, 377, 834]

随着讨论转向有趣的性能方面,以下是Serieslist输入不同的以下版本的比较方式。提供pd.Series.tolist()作为generator表达式的输入效果相当不错:

import pandas as pd
import numpy as np

s = pd.Series(np.random.randint(0,1000,size=200000))

def chunks_gen_tolist(s):
    c = chunks(s.tolist(), 10)
    for row in c:
        next(c)

%timeit chunks_gen_tolist(s)
100 loops, best of 3: 14.2 ms per loop

与使用list输入相比没有太大区别:

def chunks_gen_l(s):
    l = s.tolist()
    c = chunks(l, 10)
    for row in c:
        next(c)

%timeit chunks_gen_l2()
100 loops, best of 3: 14.1 ms per loop

关于@Padraic Cunningham的评论:我的理解是s的内存在第一次创建时被分配,而chunk函数返回generator生成的内存通过slices每次致电yield() next()

基于itertools.islice的版本的性能略有下降:

from itertools import islice
def n_sli(s,n):
    it = s.__iter__()
    for sli in iter(lambda:list(islice(it, n)), []):
       yield sli   

def sli(s):
    for sli in n_sli(s, 10):
        pass

% timeit sli(s)
10 loops, best of 3: 21.8 ms per loop

根据您的目的,您现在可以使用几种可行的选项。

答案 1 :(得分:2)

如果您想懒惰地提取值而不先创建tolist()将完成的完整列表:

from itertools import islice

s = pd.Series(np.random.randint(0, 1000, size=240))

def n_sli(s,n):
    it = s.__iter__()
    for sli in iter(lambda:list(islice(it, n)), []):
       yield sli

for sli in n_sli(s, 10):
    print(sli)

您可以看到此功能与读取其他答案中建议的所有数据一样,而不会在内存中存储超过n个值:

In [30]: s = pd.Series(np.random.randint(0,1000,size=200000))

In [31]: %%timeit
for r in n_sli(s, 1000):
    pass
   ....: 
100 loops, best of 3: 8.82 ms per loop

In [32]: %%timeit
for r in chunks(s, 1000):
    pass
   ....: 
100 loops, best of 3: 8.85 ms per loop

答案 2 :(得分:0)

IIUC你可以通过for循环和tolist方法得到你的块(如果你真的需要一个列表但不是Pandas系列的一部分):

chunks = [s.tolist()[i:i+10] for i in range(0, s.size, 10)]

In [187]: chunks
Out[187]: 
[[555, 262, 516, 482, 940, 851, 889, 896, 597, 240],
 [530, 300, 464, 908, 565, 219, 421, 399, 64, 433],
 [488, 998, 422, 872, 612, 223, 726, 979, 886, 955],
 [164, 534, 61, 918, 225, 851, 290, 170, 815, 415],
 [755, 187, 695, 479, 836, 848, 647, 568, 135, 808],
 [442, 284, 228, 183, 506, 813, 316, 141, 267, 374],
 [640, 63, 875, 191, 98, 164, 678, 399, 164, 177],
 [725, 960, 403, 929, 597, 20, 773, 890, 677, 992],
 [658, 267, 754, 945, 506, 314, 803, 738, 583, 260],
 [153, 74, 821, 386, 451, 520, 490, 180, 602, 609],
 [473, 515, 957, 775, 138, 721, 454, 867, 990, 202],
 [934, 186, 754, 238, 486, 43, 16, 623, 338, 734],
 [825, 334, 430, 490, 571, 676, 164, 202, 391, 992],
 [909, 965, 192, 905, 792, 805, 39, 77, 600, 260],
 [577, 313, 127, 145, 250, 248, 756, 374, 56, 418],
 [595, 616, 94, 215, 758, 675, 131, 616, 501, 650],
 [327, 604, 731, 67, 543, 439, 378, 137, 79, 516],
 [615, 982, 721, 77, 851, 839, 971, 539, 535, 433],
 [631, 948, 597, 178, 686, 448, 197, 853, 713, 98],
 [206, 661, 83, 472, 694, 659, 809, 99, 916, 390],
 [957, 200, 856, 626, 588, 549, 288, 830, 257, 389],
 [793, 475, 757, 638, 469, 186, 103, 239, 734, 896],
 [988, 676, 993, 301, 785, 584, 8, 310, 388, 833],
 [42, 319, 62, 333, 115, 275, 431, 127, 420, 610]]

In [189]: chunks[0]
Out[189]: [555, 262, 516, 482, 940, 851, 889, 896, 597, 240]

修改

对于你来说,使用@Stefan的答案会更好,因为它的速度更快。虽然有趣的是s.iloc[i:i+10].tolist()的工作速度比s.tolist()[i:i+10]慢。一些基准测试:

def chunks(l, n):
    for i in range(0, len(l), n):
        yield l[i:i + n]

def stefan(s):
    c = chunks(s.tolist(), 10)
    for row in c:
        pass

In [286]: %timeit stefan(s)
10000 loops, best of 3: 31.3 µs per loop

In [287]: %timeit [s.tolist()[i:i+10] for i in range(0, s.size, 10)]
1000 loops, best of 3: 562 µs per loop

In [288]: %timeit [s.iloc[i:i+10].tolist() for i in range(0, s.size, 10)]
1000 loops, best of 3: 1.73 ms per loop

<强> EDIT2

正如@PadraicCunningham在评论中指出的那样,将s.tolist()分配给某个列表然后使用for循环更好:

In [12]: %timeit [s.tolist()[i:i+10] for i in range(0, s.size, 10)]
1000 loops, best of 3: 415 µs per loop

In [14]: %%timeit
    s_list = s.tolist()
    [s_list[i:i+10] for i in range(0, len(s_list), 10)]
    ....: 
10000 loops, best of 3: 22.8 µs per loop