我正在按PD数据框中的项目日期对进行分组,并希望使用lambda将一些自定义条件函数添加到更大的聚合函数中。
使用提示here,我可以执行以下操作,它可以正常工作并计算给定列中的正值和负值。
item_day_count=item_day_group['PriceDiff_pct'].agg({'Pos':lambda val: (val > 0).sum(),'Neg':lambda val: (val <= 0).sum()}).reset_index()
我还可以做一个不同的聚合,其中包含预先构建的聚合和返回正确统计数据的自定义百分位函数:
item_day_count_v2=item_day_group['PriceDiff_pct'].agg(['count','min',percentile(25),'mean','median',percentile(75),'max']).reset_index()
但是我无法弄清楚如何将这些组合成一个更大的函数 - 当我尝试以下内容时,我得到错误:AttributeError: 'DataFrameGroupBy' object has no attribute 'name'
:
item_day_count_v3=item_day_group['PriceDiff_pct'].agg(['count',{'Pos_Return':lambda val: (val > 0).sum(),'Neg_Return':lambda val: (val <= 0).sum()},'min',percentile(25),'mean','median',percentile(75),'max']).reset_index()
有谁知道如何组合这些功能?看起来像是我很接近考虑两个单独的工作。谢谢你的帮助!
答案 0 :(得分:0)
我建议不要在dict和本机聚合器中组合定义的func。您可以将它们作为具有函数名称和函数的元组列表传递,如下所示:
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <time.h>
#include <string.h>
#include <sys/types.h>
#include <sys/wait.h>
#define BUFSIZE 25
int
main(int argc, char *argv[])
{
srand(time(NULL));
pid_t pid;
int mypipefd[2];
int ret;
char buf[BUFSIZE];
int output;
int stream;
int nbytes;
ret = pipe(mypipefd);
if (ret == -1) {
perror("pipe error");
exit(1);
}
pid = fork();
if (pid == -1) {
perror("FORK ERROR...");
exit(2);
}
if (pid == 0) {
/* CHILD */
printf(" Child process...\n");
stream = open("input.txt", O_RDONLY);
if (close(mypipefd[0]) == -1) {
perror("ERROR CLOSING PIPE");
exit(3);
}
while ((nbytes = read(stream, buf, BUFSIZE)) > 0) {
sleep(rand() % 2);
// NOTE/FIX: writing to pipes _can_ generate a _short_ write. that
// is, (e.g.) if the length given to write is 20, the return value
// may be only 15. this means that the remaining 5 bytes must be
// sent in a second/subsequent write
int off;
int wlen;
for (off = 0; nbytes > 0; off += wlen, nbytes -= wlen) {
wlen = write(mypipefd[1], buf + off, nbytes);
if (wlen < 0) {
perror("ERROR WRITING TO FILE");
exit(3);
}
if (wlen == 0)
break;
}
}
if (close(stream) == -1) {
perror("ERROR CLOSING STREAM");
exit(4);
}
// NOTE/FIX: child must close it's side of the pipe
// NOTE/ERRCODE: check error code here
close(mypipefd[1]);
}
else {
/* PARENT */
printf(" Parent process...\n");
// NOTE/FIX: this must be closed _before_ the read loop -- holding it
// open prevents parent from seeing EOF on pipe
if (close(mypipefd[1]) == -1) {
perror("ERROR CLOSING PIPE");
exit(6);
}
// NOTE/ERRCODE: this should be checked for -1 (i.e. output file
// could not be opened for file permission, etc. or other reasons
// similar to those for the file write below)
output = open("output.txt", O_CREAT | O_WRONLY, 00777);
// NOTE/FIX: we read one less than buffer size to allow for adding an
// artificial zero byte at the end
while ((nbytes = read(mypipefd[0], buf, BUFSIZE - 1)) > 0) {
// NOTE/ERRCODE: error handling _could_ be added here but it would
// be rare (e.g. filesystem has an I/O error because it's full or
// marked R/O because of an I/O error on the underlying disk)
write(output, buf, nbytes);
// write partial buffer to stdout
buf[nbytes] = 0;
printf("buf: %s\n",buf);
}
if (close(output) == -1) {
perror("ERROR CLOSING OUTPUT");
exit(5);
}
// NOTE/FIX: this is missing (prevents orphan/zombie child process)
// NOTE/ERRCODE: yes, this _can_ have an error return but here it's
// unlikely because we _know_ that pid is valid
// what can be done is to do:
// int status;
// waitpid(pid,&status,0)
// then process the return code from the child using the W* macros
// provided (e.g. WIFEXITED, WSTATUS) on status
waitpid(pid, NULL, 0);
}
return 0;
}
函数名称将是列名。
答案 1 :(得分:0)
来自pandas docs的aggregate()方法:
接受的组合是:
字符串函数名称
功能
功能列表
列名称的词典 - &gt;功能(或功能列表)
我会说它不支持所有组合。
所以,你可以试试这个:
首先获取dict中的所有内容,然后使用该dict获取。
# The functions to agg on every column.
agg_dict = dict((c, ['count','min',percentile(25),'mean','median',percentile(75),'max']) for c in item_day.columns.values)
# Append to the dict the column-specific functions.
agg_dict['Pos_Return'] = lambda val: (val > 0).sum()
agg_dict['Neg_Return'] = lambda val: (val <= 0).sum()
# Agg using the dict.
item_day_group['PriceDiff_pct'].agg(agg_dict)
答案 2 :(得分:0)
正如其他人所说,你不能在agg()
方法中将命名函数与dict混合使用。
这是一个想要你想要的实用方法。让我们来构建一些数据。
df = pd.DataFrame({'A':['x', 'y']*3,
'B':[10,20,30,40,50,60]})
df
Out[38]:
A B
0 x 10
1 y 20
2 x 30
3 y 40
4 x 50
5 y 60
定义一个函数来计算大于或等于30的值。
def ge30(x):
return (x>=30).sum()
现在在groupby().agg()
。
df.groupby('A').agg(['sum', 'mean', ge30])
Out[40]:
B
sum mean ge30
A
x 90 30 2
y 120 40 2