I am trying to aggregate some statistics from a groupby object on chunks of data. I have to chunk the data because there are many (18 million) rows. I want to find the number of rows in each group in each chunk, then sum them together. I can add groupby objects but when a group is not present in one term, a NaN is the result. See this case:
CREATE MUTLISET TABLE ();
INSERT INTO ()
SELECT ()
But I want to see:
>>> df = pd.DataFrame({'X': ['A','B','C','A','B','C','B','C','D','B','C','D'],
'Y': range(12)})
>>> df
X Y
0 A 0
1 B 1
2 C 2
3 A 3
4 B 4
5 C 5
6 B 6
7 C 7
8 D 8
9 B 9
10 C 10
11 D 11
>>> df[0:6].groupby(['X']).count() + df[6:].groupby(['X']).count()
Y
X
A NaN
B 4
C 4
D NaN
Is there a good way to do this? Note in the real code I am looping through a chunked iterator of a million rows per groupby.
答案 0 :(得分:2)
Call add
and pass MacBook-Pro:LifeIT-war-games-frontend ryan$ docker build -t wargames-front-end .
Sending build context to Docker daemon 813.6 kB
Sending build context to Docker daemon
Step 0 : FROM nginx
---> 42a3cf88f3f0
Step 1 : COPY app /usr/share/nginx/html
---> Using cache
---> 61402e6eb300
Successfully built 61402e6eb300
MacBook-Pro:LifeIT-war-games-frontend ryan$ docker run --name wargames-front-end -d -p 8080:8080 wargames-front-end
9f7daa48a25bdc09e4398fed5d846dd0eb4ee234bcfe89744268bee3e5706e54
MacBook-Pro:LifeIT-war-games-frontend ryan$ curl localhost:8080
curl: (52) Empty reply from server
MacBook-Pro:LifeIT-war-games-frontend ryan$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
9f7daa48a25b wargames-front-end:latest "nginx -g 'daemon of 3 minutes ago Up 3 minutes 80/tcp, 0.0.0.0:8080->8080/tcp, 443/tcp wargames-front-end
you could iteratively add whilst chunking I guess:
fill_value=0