如果数据帧行超过64KB,则拆分该行

时间:2020-09-17 00:59:41

标签: python pandas split pyspark-dataframes

我有2列的数据框。 id,名称 Id 是整数,名称是列表。

我想检查该行的utf-8长度是否大于64KB。如果大于64KB,那么我想将该行分成 N 行,以使每个新行都小于64KB的大小限制。

这是我到目前为止所做的

import pandas as pd

def split_data_frame_list(df, target_column):
    """
    Splits a column with lists into rows
    
    Keyword arguments:
        df -- dataframe
        target_column -- name of column that contains lists        
    """
    # create a new dataframe with each item in a seperate column, dropping rows with missing values
    col_df = pd.DataFrame(df[target_column].dropna().tolist(),index=df[target_column].dropna().index)

    # create a series with columns stacked as rows         
    stacked = col_df.stack()

    # rename last column to 'idx'
    index = stacked.index.rename(names="idx", level=-1)
    new_df = pd.DataFrame(stacked, index=index, columns=[target_column])
    return new_df


df = pd.read_csv(csv_file_name)

df_new=df.groupby(['id']).agg(lambda x: tuple(x)).applymap(list).reset_index()

lStr = int(df['name'].str.encode(encoding='utf-8').str.len().max())
maxlen=64000
mStr = json.dumps(df_new['name'].T.to_dict(), ensure_ascii=False, sort_keys=True).encode('utf-8')
if lStr > maxlen:
    n = int(math.ceil(float(lStr)/maxlen))
    eId=df_new['id'].to_string(index=False)
    print("Splitting row with id=%s of len=%d into %d pieces of upto %d" % (eId, lStr, n, maxlen))
    split_df=split_data_frame_list(df_new, 'name')


The split_data_frame_list create 1 row for each element in my *Name* column. 
Im stuck at how to change the function to make sure it only split in a way that each new/split row do not exceed the 64KB limit. 

Any inputs will be of great help. 

Thank you


0 个答案:

没有答案