在Python中重现R的汇总/重塑结果

时间:2017-02-28 19:05:36

标签: python r pandas dplyr

我想在使用Python的aggregate函数时重现行为或R的melt函数。

在R中,数据如下:

library("dplyr")

data <- summarise(group_by(table, project, resourcetype), 
                  count = n_distinct(resource_id))

  project resourcetype count
   <fctr>       <fctr> <int>
1 1000001            O     7
2 1000002            O     6
3 1000003            O    18
4 1000004            C     1
5 1000004            I     1
6 1000004            O    19
7 1000005            I     2
8 1000005            O    11
9 1000006            O     4

reshape(as.data.frame(data), 
        timevar = "resourcetype", 
        idvar = "project", 
        direction = "wide", 
        sep = "_")

  project count_O count_C count_I
1 1000001       7      NA      NA
2 1000002       6      NA      NA
3 1000003      18      NA      NA
4 1000004      19       1       1
7 1000005      11      NA       2
9 1000006       4      NA      NA

现在,在Python中我得到:

import pandas as pd

data = table.groupby(['project', 'resourcetype'], as_index=False)\
       .agg({'resource_id': {'count': 'nunique'}})

   project resourcetype resource_id
                              count
0  1000001            O           7
1  1000002            O           6
2  1000003            O          18
3  1000004            C           1
4  1000005            I           1
5  1000006            O          19
6  1000007            I           2
7  1000008            O          11
8  1000009            O           4

我有多索引,我希望用as_index=False消除。我在最后一栏中有resource_idcount,我希望在R中只有count

我试图在Python中使用melt函数,但无济于事。

编辑:原始数据是一个包含2000行和19列的表格。

Edit2 :关于多索引问题。

table.groupby(['project', 'resourcetype'])\
.agg({'resource_id': {'count': 'nunique'}}).reset_index()
   project resourcetype resource_id
                              count
0  1000001            O           7

table.groupby(['project', 'resourcetype'])\
.agg({'resource_id': {'count': 'nunique'}})
                     resource_id
                           count
project resourcetype                   
1000001 O                      7

我想得到的是:

   project resourcetype count
0  1000001            O     7

3 个答案:

答案 0 :(得分:1)

考虑更新列名的pandas'pivot

from io import StringIO
import pandas as pd

# REPRODUCIBLE EXAMPLE
text ="""
project resourcetype count
1000001            O     7
1000002            O     6
1000003            O    18
1000004            C     1
1000004            I     1
1000004            O    19
1000005            I     2
1000005            O    11
1000006            O     4
"""    
df = pd.read_table(StringIO(text), sep="\s+")

# PIVOTED DATA
pvtdf = df.pivot(index='project', columns='resourcetype', values='count')

# RENAME COLUMNS WITH RESET_INDEX
pvtdf.columns = ['count_'+str(i) for i in pvtdf.columns.values]
pvtdf = pvtdf.reset_index()

print(pvtdf)
#    project  count_C  count_I  count_O
# 0  1000001      NaN      NaN      7.0
# 1  1000002      NaN      NaN      6.0
# 2  1000003      NaN      NaN     18.0
# 3  1000004      1.0      1.0     19.0
# 4  1000005      NaN      2.0     11.0
# 5  1000006      NaN      NaN      4.0

答案 1 :(得分:0)

显而易见的解决方案:)

import pandas
import rpy2
from rpy2 import robjects
from rpy2.robjects import pandas2ri

rdf = robjects.r('''
data <- summarise(group_by(table, project, resourcetype), 
                  count = n_distinct(resource_id))
data <- summarise(group_by(table, project, resourcetype), 
                  count = n_distinct(resource_id))
                  reshape(as.data.frame(data), 
        timevar = "resourcetype", 
        idvar = "project", 
        direction = "wide", 
        sep = "_")
        data[is.na(data)] <- NaN
        data
''')

pd_df = pandas2ri.ri2py_dataframe(rdf)

答案 2 :(得分:0)

我们还可以使用 tidyrpivot_wider 代替 reshape:

r$> library(tidyr)
r$> library(dplyr)
r$> data = tribble( 
      ~project, ~resourcetype, ~count, 
      1000001,  "O",            7, 
      1000002,  "O",            6, 
      1000003,  "O",           18, 
      1000004,  "C",            1, 
      1000004,  "I",            1, 
      1000004,  "O",           19, 
      1000005,  "I",            2, 
      1000005,  "O",           11, 
      1000006,  "O",            4 
    ) 
r$> pivot_wider(
        data, 
        names_from=resourcetype, 
        values_from=count,
        names_glue="count_{.resourcetype}"
    )                                                                               
# A tibble: 6 x 4
  project count_O count_C count_I
    <dbl>   <dbl>   <dbl>   <dbl>
1 1000001       7      NA      NA
2 1000002       6      NA      NA
3 1000003      18      NA      NA
4 1000004      19       1       1
5 1000005      11      NA       2
6 1000006       4      NA      NA

在 python 中,您可以使用 datar

>>> from datar.all import f, tribble, pivot_wider
>>> 
>>> df = tribble(
...     f.project, f.resourcetype, f.count,
...     1000001,   "O",            7,
...     1000002,   "O",            6,
...     1000003,   "O",            18,
...     1000004,   "C",            1,
...     1000004,   "I",            1,
...     1000004,   "O",            19,
...     1000005,   "I",            2,
...     1000005,   "O",            11,
...     1000006,   "O",            4,
... )
>>> df >> pivot_wider(
...     names_from=f.resourcetype,
...     names_glue="count_{resourcetype}",
...     values_from=f.count,
... )
   project   count_C   count_I   count_O
   <int64> <float64> <float64> <float64>
0  1000001       NaN       NaN       7.0
1  1000002       NaN       NaN       6.0
2  1000003       NaN       NaN      18.0
3  1000004       1.0       1.0      19.0
4  1000005       NaN       2.0      11.0
5  1000006       NaN       NaN       4.0

免责声明:我是 datar 软件包的作者。