我有2列的数据框。 id,名称。 Id 是整数,名称是列表。
我想检查该行的utf-8长度是否大于64KB。如果大于64KB,那么我想将该行分成 N 行,以使每个新行都小于64KB的大小限制。
这是我到目前为止所做的
import pandas as pd
def split_data_frame_list(df, target_column):
"""
Splits a column with lists into rows
Keyword arguments:
df -- dataframe
target_column -- name of column that contains lists
"""
# create a new dataframe with each item in a seperate column, dropping rows with missing values
col_df = pd.DataFrame(df[target_column].dropna().tolist(),index=df[target_column].dropna().index)
# create a series with columns stacked as rows
stacked = col_df.stack()
# rename last column to 'idx'
index = stacked.index.rename(names="idx", level=-1)
new_df = pd.DataFrame(stacked, index=index, columns=[target_column])
return new_df
df = pd.read_csv(csv_file_name)
df_new=df.groupby(['id']).agg(lambda x: tuple(x)).applymap(list).reset_index()
lStr = int(df['name'].str.encode(encoding='utf-8').str.len().max())
maxlen=64000
mStr = json.dumps(df_new['name'].T.to_dict(), ensure_ascii=False, sort_keys=True).encode('utf-8')
if lStr > maxlen:
n = int(math.ceil(float(lStr)/maxlen))
eId=df_new['id'].to_string(index=False)
print("Splitting row with id=%s of len=%d into %d pieces of upto %d" % (eId, lStr, n, maxlen))
split_df=split_data_frame_list(df_new, 'name')
The split_data_frame_list create 1 row for each element in my *Name* column.
Im stuck at how to change the function to make sure it only split in a way that each new/split row do not exceed the 64KB limit.
Any inputs will be of great help.
Thank you