使用dask将过滤功能应用于数据块

时间:2020-01-02 17:51:54

标签: python dask

我编写了一个使用pandas对数据进行下采样的函数,但是我没有将某些数据集存储在内存中,因此我想尝试一下,这是我现在可以使用的代码:

def sample_df(df,target_column = "target",positive_percentage = 35,index_col="index"):
    """
    Takes as input a data frame with imbalanced records, e.g. x% of positive cases, and returns
    a dataframe with the specified percentage, e.g 10%.
    This is accomplished by downsampling the majority class.



    """

    positive_cases =  df[df[target_column]==1][index_col]
    number_of_samples = int(((100/positive_percentage)-1)*len(positive_cases))
    negative_cases =  list(set(df[index_col]) - set(positive_cases))

    try:
        negative_sample = random.sample(negative_cases,number_of_samples)
    except ValueError:
        print ("The requests percentage is not valid for this dataset")
        return pd.DataFrame()

    final_sample = list(negative_sample) + list(positive_cases)
    #df = df.iloc[final_sample]
    df = df[df[index_col].isin(final_sample) ] 
    #df = df.reset_index(drop=True)

    print ("New percentage is: ",  df[target_column].sum()/len(df[target_column])*100 )

    return df

该功能可以用作:

import pandas as pd
import random
from sklearn.datasets import make_classification

x,y = make_classification(100000,500)
df = pd.DataFrame(x)
df["target"] = y
df["id"] = 1 
df["id"] = df["id"].cumsum()
output_df = sample_df(df,target_column = "target",positive_percentage = 65,index_col="id")

这对于处理小型数据集的熊猫来说效果很好,但是当我尝试使用大熊猫或计算机崩溃导致内存不足的数据集时

如何将此功能应用于dask读取的每个数据块,然后全部合并?

1 个答案:

答案 0 :(得分:1)

此方法将在纯大熊猫中运行,并且不需要依赖于子采样数据集的大小而繁琐。您可以对df进行分块,然后将过滤器应用于每个块,然后将每个块附加到一个空的数据框。您像对df一样对块进行操作。我将从文件开始,因为您已经说过无法将数据加载到内存中。因此,我将函数中的df arg更改为infile,并添加了chunk_size参数并将默认值设置为10000,因此每个块将被处理为10000行:

def sample_df(infile,target_column = "target",positive_percentage = 35,index_col="index", chunk_size=10000):
    """
    Takes as input a data frame with imbalanced records, e.g. x% of positive cases, and returns
    a dataframe with the specified percentage, e.g 10%.
    This is accomplished by downsampling the majority class.
    """
    df = pd.DataFrame()
    for chunk in pd.read_csv(infile, chunksize=chunk_size):
        positive_cases =  chunk[chunk[target_column]==1][index_col]
        number_of_samples = int(((100/positive_percentage)-1)*len(positive_cases))
        negative_cases =  list(set(chunk[index_col]) - set(positive_cases))

        try:
            negative_sample = random.sample(negative_cases,number_of_samples)
        except ValueError:
            print ("The requests percentage is not valid for this dataset")
            return pd.DataFrame()

        final_sample = list(negative_sample) + list(positive_cases)
        #subdf = chunk.iloc[final_sample]
        subdf = chunk[chunk[index_col].isin(final_sample) ] 
        #subdf = chunk.reset_index(drop=True)
        # append each subsampled chunk to your df
        df = df.append(subdf)

    print ("New percentage is: ",  df[target_column].sum()/len(df[target_column])*100 )

    return df

这样做将对每个数据块而不是整个df进行子采样。