我对Python领域比较陌生,并试图将其用作进行数据分析的备份平台。我通常使用data.table
来满足我的数据分析需求。
问题是,当我在大型CSV文件(随机,压缩,上传到http://www.filedropper.com/ddataredact_1)上运行组聚合操作时,Python抛出:
分组熊猫返回getattr(obj,method)(* args,** kwds) ValueError:不允许使用负尺寸
或者(我什至遇到过...)
文件“ C:\ Anaconda3 \ lib \ site-packages \ pandas \ core \ reshape \ util.py”, 第65行,在笛卡尔_产品中 对于i,x枚举(X)]文件“ C:\ Anaconda3 \ lib \ site-packages \ pandas \ core \ reshape \ util.py”,第65行, 在 对于i,在enumerate(X)中使用x)]文件“ C:\ Anaconda3 \ lib \ site-packages \ numpy \ core \ fromnumeric.py”,行445, 重复 return _wrapfunc(a,'repeat',repeats,axis = axis)文件“ C:\ Anaconda3 \ lib \ site-packages \ numpy \ core \ fromnumeric.py”,第51行, 在_wrapfunc中 返回getattr(obj,method)(* args,** kwds)MemoryError
我花了三天时间尝试减小文件大小(我可以将文件大小减小89%),添加断点,对其进行调试,但是我却没有取得任何进展。
令人惊讶的是,我想到了在R的data.table
中运行相同的组/聚合操作,几乎不需要1秒。此外,在https://www.dataquest.io/blog/pandas-big-data/中建议不必进行任何数据类型转换等操作。
我还研究了其他线程:Avoiding Memory Issues For GroupBy on Large Pandas DataFrame,Pandas: df.groupby() is too slow for big data set. Any alternatives methods?和pandas groupby with sum() on large csv file?。看来这些线程更多地与矩阵乘法有关。如果您不将其标记为重复项,我将不胜感激。
这是我的Python代码:
finaldatapath = "..\Data_R"
ddata = pd.read_csv(finaldatapath +"\\"+"ddata_redact.csv", low_memory=False,encoding ="ISO-8859-1")
#before optimization: 353MB
ddata.info(memory_usage="deep")
#optimize file: Object-types are the biggest culprit.
ddata_obj = ddata.select_dtypes(include=['object']).copy()
#Now convert this to category type:
#Float type didn't help much, so I am excluding it here.
for col in ddata_obj:
del ddata[col]
ddata.loc[:, col] = ddata_obj[col].astype('category')
#release memory
del ddata_obj
#after optimization: 39MB
ddata.info(memory_usage="deep")
#Create a list of grouping variables:
group_column_list = [
"Business",
"Device_Family",
"Geo",
"Segment",
"Cust_Name",
"GID",
"Device ID",
"Seller",
"C9Phone_Margins_Flag",
"C9Phone_Cust_Y_N",
"ANDroid_Lic_Type",
"Type",
"Term",
'Cust_ANDroid_Margin_Bucket',
'Cust_Mobile_Margin_Bucket',
# # 'Cust_Android_App_Bucket',
'ANDroind_App_Cust_Y_N'
]
print("Analyzing data now...")
def ddata_agg(x):
names = {
'ANDroid_Margin': x['ANDroid_Margin'].sum(),
'Margins': x['Margins'].sum(),
'ANDroid_App_Qty': x['ANDroid_App_Qty'].sum(),
'Apple_Margin':x['Apple_Margin'].sum(),
'P_Lic':x['P_Lic'].sum(),
'Cust_ANDroid_Margins':x['Cust_ANDroid_Margins'].mean(),
'Cust_Mobile_Margins':x['Cust_Mobile_Margins'].mean(),
'Cust_ANDroid_App_Qty':x['Cust_ANDroid_App_Qty'].mean()
}
return pd.Series(names)
ddata=ddata.reset_index(drop=True)
ddata = ddata.groupby(group_column_list).apply(ddata_agg)
上面的.groupby
操作使代码崩溃。
有人可以帮我吗?与其他文章相比,我在StackOverflow文章上花费的时间最多,试图修复它并学习有关Python的新知识。但是,我已经达到饱和状态,这使我感到非常沮丧,因为R
的{{1}}包在不到2秒的时间内处理了该文件。这篇文章不是关于R和Python的优缺点,而是关于使用Python提高生产力。
我完全迷路了,感谢您的帮助。
这是我的data.table
data.table
代码:
R
除了Josemz的评论外,这是path_r = "../ddata_redact.csv"
ddata<-data.table::fread(path_r,stringsAsFactors=FALSE,data.table = TRUE, header = TRUE)
group_column_list <-c(
"Business",
"Device_Family",
"Geo",
"Segment",
"Cust_Name",
"GID",
"Device ID",
"Seller",
"C9Phone_Margins_Flag",
"C9Phone_Cust_Y_N",
"ANDroid_Lic_Type",
"Type",
"Term",
'Cust_ANDroid_Margin_Bucket',
'Cust_Mobile_Margin_Bucket',
# # 'Cust_Android_App_Bucket',
'ANDroind_App_Cust_Y_N'
)
ddata<-ddata[, .(ANDroid_Margin = sum(ANDroid_Margin,na.rm = TRUE),
Margins=sum(Margins,na.rm = TRUE),
Apple_Margin=sum(Apple_Margin,na.rm=TRUE),
Cust_ANDroid_Margins = mean(Cust_ANDroid_Margins,na.rm = TRUE),
Cust_Mobile_Margins = mean(Cust_Mobile_Margins,na.rm = TRUE),
Cust_ANDroid_App_Qty = mean(Cust_ANDroid_App_Qty,na.rm = TRUE),
ANDroid_App_Qty=sum(ANDroid_App_Qty,na.rm = TRUE)
),
by=group_column_list]
与agg
和What is the difference between pandas agg and apply function?和Pandas difference between apply() and aggregate() functions
答案 0 :(得分:1)
我认为您要查找的是 agg ,而不是应用。您可以将dict映射列传递到要应用的函数,所以我认为这对您有用:
ddata = ddata.groupby(group_column_list).agg({
'ANDroid_Margin' : sum,
'Margins' : sum,
'ANDroid_App_Qty' : sum,
'Apple_Margin' : sum,
'P_Lic' : sum,
'Cust_ANDroid_Margins': 'mean',
'Cust_Mobile_Margins' : 'mean',
'Cust_ANDroid_App_Qty': 'mean'})