Question

我有两个关于dask的问题。第一：dask的文档明确指出，您可以使用与pandas相同的语法来重命名列。我正在使用dask 1.0.0。为什么我在下面出现这些错误？

df = pd.DataFrame(dictionary)
df

# I am not sure how to choose values for divisions, meta, and name. I am also pretty unsure about what these really do.
ddf = dd.DataFrame(dictionary, divisions=[8], meta=pd.DataFrame(dictionary), name='ddf')    
ddf

cols = {'Key':'key', '0':'Datetime','1':'col1','2':'col2','3':'col3','4':'col4','5':'col5'}

ddf.rename(columns=cols, inplace=True)

TypeError: rename() got an unexpected keyword argument 'inplace'

好，所以我删除了inplace=True并尝试了此操作：

ddf = ddf.rename(columns=cols)

ValueError: dictionary update sequence element #0 has length 6; 2 is required

pandas数据框显示的是真实的数据框，但是当我调用ddf.compute()时，我得到一个空的数据框。

我的第二个问题是，我对如何分配部门，元数据和名称有些困惑。如果我在单个计算机和群集上使用dask进行并行化，这有什么用/有害吗？

Answer 1

关于重命名，这是我通常在使用dask时更改功能名称的方法，也许这也对您有用：

android:name

关于确定分区数，文档提供了一个很好的示例，其中使用时间序列数据来决定如何划分数据帧：http://docs.dask.org/en/latest/dataframe-design.html#partitions。

Answer 2

我无法使这一行正常工作（因为我将dictionary作为基本的Python字典进行了传递，因此输入不正确）

ddf = dd.DataFrame(dictionary, divisions=[2], meta=pd.DataFrame(dictionary,
                                              index=list(range(2))), name='ddf')

print(ddf.compute())
() # this is the output of ddf.compute(); clearly something is not right

因此，我不得不创建一些虚拟数据，并在创建快捷方式dataframe的过程中使用它们。

在字典中生成伪数据

d = {0: [388]*2,
 1: [387]*2,
 2: [386]*2,
 3: [385]*2,
 5: [384]*2,
 '2012-06-13': [389]*2,
 '2012-06-14': [389]*2,}

从字典 dask bag

创建Dask dataframe

这意味着您必须首先使用pandas将字典转换为pandas DataFrame，然后使用.to_dict(..., orient='records')来获取创建dask bag所需的序列（按行字典的列表）< / li>

所以，这就是我创建所需序列的方式

d = pd.DataFrame(d, index=list(range(2))).to_dict('records')

print(d)
[{0: 388,
  1: 387,
  2: 386,
  3: 385,
  5: 384,
  '2012-06-13': 389,
  '2012-06-14': 389},
 {0: 388,
  1: 387,
  2: 386,
  3: 385,
  5: 384,
  '2012-06-13': 389,
  '2012-06-14': 389}]

现在，我使用词典列表来创建一个bag

快捷键

dask_bag = db.from_sequence(d, npartitions=2)

print(dask_bag)
dask.bag<from_se..., npartitions=2>

将快装袋转换为快装dataframe

df = dask_bag.to_dataframe()

重命名dataframe中的列

cols = {0:'Datetime',1:'col1',2:'col2',3:'col3',5:'col5'}
df = df.rename(columns=cols)

print(df)
Dask DataFrame Structure:
              Datetime   col1   col2   col3   col5 2012-06-13 2012-06-14
npartitions=2                                                           
                 int64  int64  int64  int64  int64      int64      int64
                   ...    ...    ...    ...    ...        ...        ...
                   ...    ...    ...    ...    ...        ...        ...
Dask Name: rename, 6 tasks

计算速度dataframe（这次！不会获得()的输出）

print(ddf.compute())
   Datetime  col1  col2  col3  col5  2012-06-13  2012-06-14
0       388   387   386   385   384         389         389
0       388   387   386   385   384         389         389

注意：

也来自.rename文档：不支持inplace。
我认为您的重命名词典包含字符串'0'，'1'等，以表示整数列名。对于您的数据（如此处的伪数据），字典可能只是整数0，1等。
对于dask docs，我基于1-1重命名字典使用了这种方法，重命名字典中未包含的列名将保持不变
- 这意味着您不需要传递不需要重命名的列名

Answer 3

如果只想小写并删除空格，则可以执行以下操作：

data = dd.read_csv('*.csv').rename(columns=lambda x: x.lower().replace(' ', '_'))

重命名dask数据框中的列

3 个答案: