Question

I am trying to aggregate some statistics from a groupby object on chunks of data. I have to chunk the data because there are many (18 million) rows. I want to find the number of rows in each group in each chunk, then sum them together. I can add groupby objects but when a group is not present in one term, a NaN is the result. See this case:

CREATE MUTLISET TABLE ();
INSERT INTO ()
SELECT ()

But I want to see:

>>> df = pd.DataFrame({'X': ['A','B','C','A','B','C','B','C','D','B','C','D'],
                       'Y': range(12)})
>>> df
    X   Y
0   A   0
1   B   1
2   C   2
3   A   3
4   B   4
5   C   5
6   B   6
7   C   7
8   D   8
9   B   9
10  C  10
11  D  11
>>> df[0:6].groupby(['X']).count() + df[6:].groupby(['X']).count()
    Y
X    
A NaN
B   4
C   4
D NaN

Is there a good way to do this? Note in the real code I am looping through a chunked iterator of a million rows per groupby.

Answer 1

Call add and pass MacBook-Pro:LifeIT-war-games-frontend ryan$ docker build -t wargames-front-end . Sending build context to Docker daemon 813.6 kB Sending build context to Docker daemon Step 0 : FROM nginx ---> 42a3cf88f3f0 Step 1 : COPY app /usr/share/nginx/html ---> Using cache ---> 61402e6eb300 Successfully built 61402e6eb300 MacBook-Pro:LifeIT-war-games-frontend ryan$ docker run --name wargames-front-end -d -p 8080:8080 wargames-front-end 9f7daa48a25bdc09e4398fed5d846dd0eb4ee234bcfe89744268bee3e5706e54 MacBook-Pro:LifeIT-war-games-frontend ryan$ curl localhost:8080 curl: (52) Empty reply from server MacBook-Pro:LifeIT-war-games-frontend ryan$ docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 9f7daa48a25b wargames-front-end:latest "nginx -g 'daemon of 3 minutes ago Up 3 minutes 80/tcp, 0.0.0.0:8080->8080/tcp, 443/tcp wargames-front-end you could iteratively add whilst chunking I guess:

fill_value=0

Aggregation of pandas groupby objects

1 个答案: