如何通过索引列表从dask数据框中选择数据?

时间:2016-07-12 00:19:54

标签: python indexing dask

让我们说,我有以下的dask数据帧。

dict_ = {'A':[1,2,3,4,5,6,7], 'B':[2,3,4,5,6,7,8], 'index':['x1', 'a2', 'x3', 'c4', 'x5', 'y6', 'x7']}
pdf = pd.DataFrame(dict_)
pdf = pdf.set_index('index')
ddf = dask.dataframe.from_pandas(pdf, npartitions = 2)

此外,我有一个我感兴趣的索引列表,例如

indices_i_want_to_select = ['x1','x3', 'y6']

如何生成一个新的dask数据帧,它只包含索引指定的行?有没有理由,为什么有些像ddf [ddf.A> = 4]是可能的,而ddf [indices_i_want_to_select中的ddf.index]或ddf.loc [indices_i_want_to_select]不是?

2 个答案:

答案 0 :(得分:5)

以下似乎有效:

import pandas as pd
import dask.dataframe as dd

#generate example dataframe
pdf = pd.DataFrame(dict(A = [1,2,3,4,5], B = [6,7,8,9,0]), index=['i1', 'i2', 'i3', 4, 5])
ddf = dd.from_pandas(pdf, npartitions = 2)

#list of indices I want to select
l = ['i1', 4, 5]

#generate new dask dataframe containing only the specified indices
ddf_selected = ddf.map_partitions(lambda x: x[x.index.isin(l)], meta = ddf.dtypes)

编辑:如果结果的顺序不重要,这只适用。

答案 1 :(得分:1)

由于混合索引类型,使用dask版本'1.2.0'会导致错误。 在任何情况下,都可以使用loc

import pandas as pd
import dask.dataframe as dd

#generate example dataframe
pdf = pd.DataFrame(dict(A = [1,2,3,4,5], B = [6,7,8,9,0]), index=['i1', 'i2', 'i3', '4', '5'])
ddf = dd.from_pandas(pdf, npartitions = 2,)

# #list of indices I want to select
l = ['i1', '4', '5']

# #generate new dask dataframe containing only the specified indices
# ddf_selected = ddf.map_partitions(lambda x: x[x.index.isin(l)], meta = ddf.dtypes)
ddf_selected = ddf.loc[l]
ddf_selected.head()