Question

该问题是apply lambda function to a dask dataframe的预期解决方案。此解决方案不需要实现pandas数据框。这背后的原因是我有一个比内存更大的数据帧，并且无法像在熊猫中那样将其加载到内存中。（如果数据适合内存，那么熊猫真的很好）。

链接问题的解决方案如下。

df = pd.DataFrame({'A':['ant','ant','cherry', 'bee', 'ant'], 'B':['cat','peach', 'cat', 'cat', 'peach'], 'C':['dog','dog','roo', 'emu', 'emu']}) #How to read this sort of format directly to dask dataframe?

ddf = dd.from_pandas(df, npartitions=2) # dask conversion
list1 = ['A','B','C'] #list1 of hearder names


for c in list1:
    vc = ddf[c].value_counts().compute()
    vc /= vc.sum()
    print(vc) # A table with the proportion of unique values
    for i in range(vc.count()):
        if vc[i]<0.5: # Checks whether the varaible value has a proportion of less than .5
            ddf[c] = ddf[c].where(ddf[c] != vc.index[i], 'others') #changes such variable value to 'others' (iterates though all clumns mentioned in list1)
    print(ddf.compute()) #shows how changes have been implemented column by column

但是，第二 for 循环在实际（大于内存）数据帧中需要很长时间的计算。是否有使用dask获得相同输出的更有效方法。

该代码的目的是将出现时间少于50％的标签的列变量值更改为others。例如，如果值ant在某列中的出现时间少于50％，则将名称更改为others

任何人都可以在这方面帮助我。

谢谢

迈克尔

Answer 1

这是一种跳过嵌套循环的方法：

import pandas as pd
import dask.dataframe as dd

df = pd.DataFrame({'A':['ant','ant','cherry', 'bee', 'ant'],
                   'B':['cat','peach', 'cat', 'cat', 'peach'],
                   'C':['dog','dog','roo', 'emu', 'emu']})

ddf = dd.from_pandas(df, npartitions=2)

l = len(ddf)

for col in ddf.columns:
    vc = (ddf[col].value_counts()/l)
    vc = vc[vc>.5].index.compute()
    ddf[col] = ddf[col].where(ddf[col].isin(vc), "other")

ddf = ddf.compute()

如果您有一个非常大的数据框，并且是拼花格式，则可以尝试逐列读取并将其保存到其他文件中。最后，您可以将它们水平连接。

根据dask数据帧中的条件更改列变量值

1 个答案: