如何在dask中映射函数

时间:2020-06-22 23:23:15

标签: python pandas dataframe dask

我正在使用Dask处理数据框(来自CSV文件),并且正在寻找一种方法来使用诸如mapapply之类的功能来改进此代码,因为在大型文件中花了这么长时间(我知道嵌套for并使用iterrows()是我能做出的最糟糕的想法)

NAN_VALUES = [-999, "INVALID", -9999]
_all_rows=list()
for index, row in df.iterrows():
    _row = list()
    for key, value in row.iteritems():
        if value in NAN_VALUES or pd.isnull(value):
            _row.append(None)
        else:
            _row.append(apply_transform(key, value))
    _all_rows.append(_row)
    rows_count += 1

如何使用map_partitionspandas.map映射此代码?!

额外:更多上下文信息: 为了能够应用某些功能,我将NaN值替换为默认值。最后,我需要为每行创建一个列表,将默认值替换为“无”。

1.-原始DF

 "name"    "age"    "money"
---------------------------
"David"     NaN      12.345 
"Jhon"      22        NaN    
"Charles"   30       123.45 
  NaN       NaN       NaN    

2.-将NaN传递给默认值

 "name"       "age"    "money"
------------------------------
"David"       -999     12.345 
"Jhon"         22      -9999  
"Charles"      30      123.45 
"INVALID"     -999     -9999  

3.-每行解析到一个列表

"name"  , "age", "money"
------------------------
["David", None, 12.345]
["Jhon", 22, None]
["Charles", 30, 123.45]
[None, None, None]

1 个答案:

答案 0 :(得分:1)

我的建议是尝试与熊猫一起工作,然后尝试将其翻译为dask

pandas

import pandas as pd
import numpy as np

nan = np.nan

df = {'name': {0: 'David', 1: 'John', 2: 'Charles', 3: nan},
 'age': {0: nan, 1: 22.0, 2: 30.0, 3: nan},
 'money': {0: 12.345, 1: nan, 2: 123.45, 3: nan}}

df = pd.DataFrame(df)

# These are your default values
diz = {"age": -999, "name": "INVALID", "money": -9999}

将NaN传递给默认值

for k,v in diz.items():
    df[k] = df[k].fillna(v)

获取每一行的列表

df.apply(list, axis=1)
0       [David, nan, 12.345]
1          [John, 22.0, nan]
2    [Charles, 30.0, 123.45]
3            [nan, nan, nan]
dtype: object

dask

import pandas as pd
import dask.dataframe as dd
import numpy as np

nan = np.nan

df = {'name': {0: 'David', 1: 'John', 2: 'Charles', 3: nan},
 'age': {0: nan, 1: 22.0, 2: 30.0, 3: nan},
 'money': {0: 12.345, 1: nan, 2: 123.45, 3: nan}}

df = pd.DataFrame(df)

# These are your default values
diz = {"age": -999, "name": "INVALID", "money": -9999}

# transform to dask dataframe
df = dd.from_pandas(df, npartitions=2)

将NaN传递给默认值

这与以前完全相同。请注意,由于dask是懒惰的,因此如果您想查看效果df.compute()

,则应运行
for k,v in diz.items():
    df[k] = df[k].fillna(v)

获取每一行的列表

在这里要求您明确声明输出的dtype时,情况有所改变

df.apply(list, axis=1, meta=(None, 'object'))

最终,您可以按照以下方式使用map_partitions

df.map_partitions(lambda x: x.apply(list, axis=1))

备注,请注意,如果您的数据适合存储在内存中,则不需要dask,而pandas可能会更快。