我有以下数据存储在csv(df_sample.csv)中。我在名为cols_list的列表中有列名。
df_data_sample:
df_data_sample = pd.DataFrame({
'new_video':['BASE','SHIVER','PREFER','BASE+','BASE+','EVAL','EVAL','PREFER','ECON','EVAL'],
'ord_m1':[0,1,1,0,0,0,1,0,1,0],
'rev_m1':[0,0,25.26,0,0,9.91,'NA',0,0,0],
'equip_m1':[0,0,0,'NA',24.9,20,76.71,57.21,0,12.86],
'oev_m1':[3.75,8.81,9.95,9.8,0,0,'NA',10,56.79,30],
'irev_m1':['NA',19.95,0,0,4.95,0,0,29.95,'NA',13.95]
})
attribute_dict = {
'new_video': 'CAT',
'ord_m1':'NUM',
'rev_m1':'NUM',
'equip_m1':'NUM',
'oev_m1':'NUM',
'irev_m1':'NUM'
}
然后我阅读每一栏并进行如下的数据处理:
cols_list = df_data_sample.columns
# Write to csv.
df_data_sample.to_csv("df_seg_sample.csv",index = False)
#df_data_sample = pd.read_csv("df_seg_sample.csv")
#Create empty dataframe to hold final processed data for each income level.
df_final = pd.DataFrame()
# Read in each column, process, and write to a csv - using csv module
for column in cols_list:
df_column = pd.read_csv('df_seg_sample.csv', usecols = [column],delimiter = ',')
if (((attribute_dict[column] == 'CAT') & (df_column[column].unique().size <= 100))==True):
df_target_attribute = pd.get_dummies(df_column[column], dummy_na=True,prefix=column)
# Check and remove duplicate columns if any:
df_target_attribute = df_target_attribute.loc[:,~df_target_attribute.columns.duplicated()]
for target_column in list(df_target_attribute.columns):
# If variance of the dummy created is zero : append it to a list and print to log file.
if ((np.var(df_target_attribute[[target_column]])[0] != 0)==True):
df_final[target_column] = df_target_attribute[[target_column]]
elif (attribute_dict[column] == 'NUM'):
#Let's impute with 0 for numeric variables:
df_target_attribute = df_column
df_target_attribute.fillna(value=0,inplace=True)
df_final[column] = df_target_attribute
attribute_dict是一个字典,包含变量名称的映射:变量类型为:
{
'new_video': 'CAT'
'ord_m1':'NUM'
'rev_m1':'NUM'
'equip_m1':'NUM'
'oev_m1':'NUM'
'irev_m1':'NUM'
}
但是,此列逐列操作需要很长时间才能在大小为**(500万行* 3400列)**的数据集上运行。目前运行时间大约是 12个多小时。 我想尽可能地减少这个,我想到的一种方法是一次处理所有NUM列,然后逐列 对于CAT变量。 但是,我不确定Python中的代码是否能够实现这一目标,也不确定这是否会真正加快这一过程。 有人可以帮助我!
答案 0 :(得分:1)
有三件事我会建议你加快计算速度:
df = df.loc[:, df.nunique() > 100]
#filter out every column where less then 100 unique values are present
大熊猫作者对大数据工作流程的这个answer也许对您有意义。
答案 1 :(得分:1)
对于数字列,这很简单:
num_cols = [k for k, v in attribute_dict.items() if v == 'NUM']
print (num_cols)
['ord_m1', 'rev_m1', 'equip_m1', 'oev_m1', 'irev_m1']
df1 = pd.read_csv('df_seg_sample.csv', usecols = [num_cols]).fillna(0)
但第一部分代码是性能问题,尤其是在5 million rows
调用df_target_attribute = pd.get_dummies(df_column[column], dummy_na=True, prefix=column)
时:
get_dummies
不幸的是,在块中存在问题进程productId|price|saleEvent|rivalName|fetchTS
123|78.73|Special|VistaCart.com|2017-05-11 15:39:30
123|45.52|Regular|ShopYourWay.com|2017-05-11 16:09:43
123|89.52|Sale|MarketPlace.com|2017-05-11 16:07:29
678|1348.73|Regular|VistaCart.com|2017-05-11 15:58:06
678|1348.73|Special|ShopYourWay.com|2017-05-11 15:44:22
678|1232.29|Daily|MarketPlace.com|2017-05-11 15:53:03
777|908.57|Daily|VistaCart.com|2017-05-11 15:39:01
。