将Pandas系列转换为形成良好的数据帧

时间:2017-05-17 21:09:57

标签: python pandas dataframe series

我有一个groupby对象:

#include <stdio.h>
#include <string>
#include <winsock2.h>
#include <ws2tcpip.h>
#pragma comment(lib, "Ws2_32.lib")
#include <windows.h>
#include <stdio.h>
#include <conio.h>

#define SERVERPORT 1900
char buff[] = "M-SEARCH * HTTP/1.1\r\nHOST: 239.255.255.250:1900\r\nMAN: \"ssdp:discover\"\r\nMX: 3\r\nST: upnp:rootdevice\r\n";

int main()
{
    char rcvdbuff[1000];
    int len, Ret = 2;

    WSADATA wsaData;
    struct sockaddr_in their_addr;
    SOCKET sock;
    WSAStartup(MAKEWORD(2, 2), &wsaData);

    sock = socket(AF_INET, SOCK_DGRAM, 0);

    their_addr.sin_family = AF_INET;

    their_addr.sin_addr.s_addr = inet_addr("239.255.255.250");
    their_addr.sin_port = htons(SERVERPORT);
    len = sizeof(struct sockaddr_in);

    while (1)
    {
        printf("buff:\n%s\n", buff);
        Ret = sendto(sock, buff, strlen(buff), 0, (struct sockaddr*)&their_addr, len);
        if (Ret < 0)
        {
            printf("error in SENDTO() function");
            closesocket(sock);
            return 0;
        }

        //Receiving Text from server
        printf("\n\nwaiting to recv:\n");
        memset(rcvdbuff, 0, sizeof(rcvdbuff));
        Ret = recvfrom(sock, rcvdbuff, sizeof(rcvdbuff), 0, (struct sockaddr *)&their_addr, &len);
        if (Ret < 0)
        {
            printf("Error in Receiving");
            return 0;
        }
        rcvdbuff[Ret - 1] = '\0';
        printf("RECEIVED MESSAGE FROM SERVER\t: %s\n", rcvdbuff);

        //Delay for testing purpose
        Sleep(3 * 1000);
    }
    closesocket(sock);
    WSACleanup();
}

我的目标是按照比率下降(最右边的列)排序前100个ID,其中isconfirm = 0。
为此,我考虑使用名称很好的列来获得一个漂亮的数据框,这样我就可以在isconfirm = 0时根据比率查询顶部ID。

我试过,例如,

g = dfchurn.groupby('ID')['isconfirm'].value_counts().groupby(level=0).apply(lambda x: x / float(x.sum())) 
type(g) 
Out[230]: pandas.core.series.Series
g.head(5)
Out[226]: 
ID         isconfirm
0000       0            0.985981
           1            0.014019
0064       0            0.996448
           1            0.003552
0080       0            0.997137   

那在任何地方都没有领先优势。必须有一个干净简洁的方法来做到这一点。

2 个答案:

答案 0 :(得分:1)

您可以使用isconfirm选择g.loc为0的所有行:

In [90]: g.loc[:, 0]
Out[90]: 
ID
0    0.827957
1    0.911111
2    0.944954
3    0.884956
4    0.931373
5    0.869048
6    0.941176
7    0.884615
8    0.901961
9    0.930693
Name: isconfirm, dtype: float64

0中的[:, 0]是指索引第二级中的值。 因此,您可以使用以下方法找到与前100个值对应的ID

In [93]: g.loc[:, 0].sort_values(ascending=False).head(100)
Out[93]: 
ID
2    0.944954
6    0.941176
4    0.931373
9    0.930693
1    0.911111
8    0.901961
3    0.884956
7    0.884615
5    0.869048
0    0.827957
Name: isconfirm, dtype: float64

In [94]: g.loc[:, 0].sort_values(ascending=False).head(100).index
Out[94]: Int64Index([2, 6, 4, 9, 1, 8, 3, 7, 5, 0], dtype='int64', name='ID')

为了产生上述结果,我用这种方式定义了g

import numpy as np
import pandas as pd
np.random.seed(2017)

N = 1000
dfchurn = pd.DataFrame({'ID':np.random.randint(10, size=N),
                        'isconfirm': np.random.choice(2, p=[0.9, 0.1], size=N)})
g = dfchurn.groupby('ID')['isconfirm'].value_counts().groupby(level=0).apply(lambda x: x / float(x.sum())) 

答案 1 :(得分:0)

我在相关问题中找到了提示:

gdf.unstack(level=1) 
gdf  = gdf.add_suffix('_ratio').reset_index()  # KEY STEP

gdf.columns   #  friendly columns now  
Index([u'ID', u'isconfirm', u'isconfirm_ratio'], dtype='object')

gdf[gdf['isconfirm_ratio'] > 0.999]   # e.g. a filter like this works now or a sort