我想在使用Python的aggregate
函数时重现行为或R的melt
函数。
在R中,数据如下:
library("dplyr")
data <- summarise(group_by(table, project, resourcetype),
count = n_distinct(resource_id))
project resourcetype count
<fctr> <fctr> <int>
1 1000001 O 7
2 1000002 O 6
3 1000003 O 18
4 1000004 C 1
5 1000004 I 1
6 1000004 O 19
7 1000005 I 2
8 1000005 O 11
9 1000006 O 4
reshape(as.data.frame(data),
timevar = "resourcetype",
idvar = "project",
direction = "wide",
sep = "_")
project count_O count_C count_I
1 1000001 7 NA NA
2 1000002 6 NA NA
3 1000003 18 NA NA
4 1000004 19 1 1
7 1000005 11 NA 2
9 1000006 4 NA NA
现在,在Python中我得到:
import pandas as pd
data = table.groupby(['project', 'resourcetype'], as_index=False)\
.agg({'resource_id': {'count': 'nunique'}})
project resourcetype resource_id
count
0 1000001 O 7
1 1000002 O 6
2 1000003 O 18
3 1000004 C 1
4 1000005 I 1
5 1000006 O 19
6 1000007 I 2
7 1000008 O 11
8 1000009 O 4
我有多索引,我希望用as_index=False
消除。我在最后一栏中有resource_id
和count
,我希望在R中只有count
。
我试图在Python中使用melt
函数,但无济于事。
编辑:原始数据是一个包含2000行和19列的表格。
Edit2 :关于多索引问题。
table.groupby(['project', 'resourcetype'])\
.agg({'resource_id': {'count': 'nunique'}}).reset_index()
project resourcetype resource_id
count
0 1000001 O 7
table.groupby(['project', 'resourcetype'])\
.agg({'resource_id': {'count': 'nunique'}})
resource_id
count
project resourcetype
1000001 O 7
我想得到的是:
project resourcetype count
0 1000001 O 7
答案 0 :(得分:1)
考虑更新列名的pandas'pivot
:
from io import StringIO
import pandas as pd
# REPRODUCIBLE EXAMPLE
text ="""
project resourcetype count
1000001 O 7
1000002 O 6
1000003 O 18
1000004 C 1
1000004 I 1
1000004 O 19
1000005 I 2
1000005 O 11
1000006 O 4
"""
df = pd.read_table(StringIO(text), sep="\s+")
# PIVOTED DATA
pvtdf = df.pivot(index='project', columns='resourcetype', values='count')
# RENAME COLUMNS WITH RESET_INDEX
pvtdf.columns = ['count_'+str(i) for i in pvtdf.columns.values]
pvtdf = pvtdf.reset_index()
print(pvtdf)
# project count_C count_I count_O
# 0 1000001 NaN NaN 7.0
# 1 1000002 NaN NaN 6.0
# 2 1000003 NaN NaN 18.0
# 3 1000004 1.0 1.0 19.0
# 4 1000005 NaN 2.0 11.0
# 5 1000006 NaN NaN 4.0
答案 1 :(得分:0)
显而易见的解决方案:)
import pandas
import rpy2
from rpy2 import robjects
from rpy2.robjects import pandas2ri
rdf = robjects.r('''
data <- summarise(group_by(table, project, resourcetype),
count = n_distinct(resource_id))
data <- summarise(group_by(table, project, resourcetype),
count = n_distinct(resource_id))
reshape(as.data.frame(data),
timevar = "resourcetype",
idvar = "project",
direction = "wide",
sep = "_")
data[is.na(data)] <- NaN
data
''')
pd_df = pandas2ri.ri2py_dataframe(rdf)
答案 2 :(得分:0)
我们还可以使用 tidyr
的 pivot_wider
代替 reshape:
r$> library(tidyr)
r$> library(dplyr)
r$> data = tribble(
~project, ~resourcetype, ~count,
1000001, "O", 7,
1000002, "O", 6,
1000003, "O", 18,
1000004, "C", 1,
1000004, "I", 1,
1000004, "O", 19,
1000005, "I", 2,
1000005, "O", 11,
1000006, "O", 4
)
r$> pivot_wider(
data,
names_from=resourcetype,
values_from=count,
names_glue="count_{.resourcetype}"
)
# A tibble: 6 x 4
project count_O count_C count_I
<dbl> <dbl> <dbl> <dbl>
1 1000001 7 NA NA
2 1000002 6 NA NA
3 1000003 18 NA NA
4 1000004 19 1 1
5 1000005 11 NA 2
6 1000006 4 NA NA
在 python 中,您可以使用 datar
:
>>> from datar.all import f, tribble, pivot_wider
>>>
>>> df = tribble(
... f.project, f.resourcetype, f.count,
... 1000001, "O", 7,
... 1000002, "O", 6,
... 1000003, "O", 18,
... 1000004, "C", 1,
... 1000004, "I", 1,
... 1000004, "O", 19,
... 1000005, "I", 2,
... 1000005, "O", 11,
... 1000006, "O", 4,
... )
>>> df >> pivot_wider(
... names_from=f.resourcetype,
... names_glue="count_{resourcetype}",
... values_from=f.count,
... )
project count_C count_I count_O
<int64> <float64> <float64> <float64>
0 1000001 NaN NaN 7.0
1 1000002 NaN NaN 6.0
2 1000003 NaN NaN 18.0
3 1000004 1.0 1.0 19.0
4 1000005 NaN 2.0 11.0
5 1000006 NaN NaN 4.0
免责声明:我是 datar
软件包的作者。