Question

我有一个包含17,850,209行的CSV，这些行太大而Pandas无法处理我的整个代码，因此我尝试使用Dask对其进行操作。我的所有代码＆＃34;工作＆＃34;但是当我将CSV写入磁盘时，我并没有获得所有17,850,209条记录。相反，我得到N个CSV（其中N = npartitions），每个只有50,000条记录，总计为900,000条记录。

首先，我读取原始CSV并使用前2行和时间戳创建干净的数据框：

import pandas as pd 
import numpy as np
import time as t 
import dask.dataframe as dd


my_dtypes = {
    'uid': object, 
    'state': object, 
    'var01': np.float64, 
    'var02': np.float64
    }

df_raw = pd.read_csv('/Users/me/input_data/stackoverflow_raw.csv', dtype = my_dtypes, sep=',') 

df_clean = pd.DataFrame(df_raw['uid'].str.strip().str.replace('{','').str.replace('}',''))

df_clean['state'] = pd.DataFrame(df_raw['state'].str.strip())

df_clean['rowcreatetimestamp'] = t.strftime("%Y-%m-%d %H:%M:%S")

这给了我以下（正确）计数：

df_clean.count()
# uid                   17850209
# state                 17850209
# rowcreatetimestamp    17850209
# dtype: int64

然后我将它移到Dask，大小为1,000,000（我的团队的大多数机器都可以处理）。

df_clean = dd.from_pandas(df_clean, chunksize=1000000) 

df_clean
# dd.DataFrame<from_pa..., npartitions=18, divisions=(0, 1000000, 2000000, ..., 17000000, 17850208)>

df_clean.compute()
# [17850209 rows x 3 columns]

df_clean.count().compute()
# uid                   17850209
# state                 17850209
# rowcreatetimestamp    17850209
# dtype: int64

然而，当我进行第一次Dask操作时，它只会＆＃34;保持＆＃34; 900,000行数据框并创建50,000个新列：

df_clean['var01'] = dd.from_array(np.where((df_raw['var01'] > 0), 1, 0))

df_clean.compute()
# [900000 rows x 4 columns]

df_clean.count().compute()
uid                   900000
state                 900000
rowcreatetimestamp    900000
var01                  50000
dtype: int64

当我将Dask数据帧写入磁盘时，我得到18个CSV，每个记录有50,000个记录。我使用了compute=True参数并省略了它并得到了相同的结果：

df_clean.to_csv('/Users/me/input_data/stackoverflow_clean_*.csv', header=True, sep=',', index=False, compute=True)

df_clean.to_csv('/Users/me/input_data/stackoverflow_clean_*.csv', header=True, sep=',', index=False)

当我写一个文件时，我得到900,000条记录加上标题：

df_clean.compute().to_csv('/Users/me/input_data/stackoverflow_clean_one_file.csv', header=True, sep=',', index=False)

（在bash中）

wc -l '/Users/me/input_data/stackoverflow_clean_one_file.csv' 
900001

虽然900,000条记录错误，但当我打开CSV时，只有前50,000行拥有var01的数据。

我已经搜索了latest documentation，但是在输出包含所有数据的块文件或具有正确行数的单个文件方面，我还没有看到我所遗漏的内容。

TIA。

Answer 1

这条线有点奇怪

df_clean['var01'] = dd.from_array(np.where((df_raw['var01'] > 0), 1, 0))

你将dask.dataframe，dask.array和numpy混合在一起。即使支持这种行为（这是不确定的），它可能会非常缓慢地混合懒惰和具体的操作。

相反，我建议使用dd.Series.where

df_clean['var01'] = df_raw.var01.where(df_raw.var01 > 0, 1)
df_clean['var01'] = df_raw.var01.where(df_raw.var01 < 0, 0)

了解分区在Dask中的工作原理

1 个答案: