让我们说我有一个名为mydf的pandas数据帧。即,
import pandas as pd
mydf = pd.DataFrame({
'type':['A','A','A', 'B','B','B', 'C'],
'state':['NY','CA','NY', 'NY','CA','CA', 'WY'],
'date':['2018-01-02','2018-01-04','2018-02-06',
'2018-01-01','2018-01-24','2018-02-10','2018-01-24']
})
Out[28]:
date state type
0 2018-01-02 NY A
1 2018-01-04 CA A
2 2018-02-06 NY A
3 2018-01-01 NY B
4 2018-01-24 CA B
5 2018-02-10 CA B
6 2018-01-24 WY C
对于所有记录(类型A,B,C)和类型A的所有记录,我想要一个表计算每个州和日期的记录总数(每年只有一个月不是每天的日期)然后是每组中A的百分比与总数。
即,最终输出将是另一个具有以下列和值的pandas数据帧:
date_ym state total_count total_type_A percentage
20181 CA 2 1 50
20181 NY 2 1 50
20181 WY 1 0 0
20182 CA 1 0 0
20182 NY 1 1 50
我可以创建两个表,然后将它们合并然后计数,但我正在寻找一个更简单的单行代码......
答案 0 :(得分:2)
首次将日期转换为月份:
mydf["date"] = mydf["date"].dt.strftime("%Y%m")
然后使用groupby.agg
:
def total_type_A(x):
return sum(x == "A")
def percentage(x):
return sum(x == "A") / len(x)
mydf.groupby(["date", "state"]).agg([len, total_type_A, percentage])
答案 1 :(得分:2)
另一种方法是创建一个函数,返回带有所需列的Series。
完整示例:
import pandas as pd
df = pd.DataFrame({
'type':['A','A','A', 'B','B','B', 'C'],
'state':['NY','CA','NY', 'NY','CA','CA', 'WY'],
'date':['2018-01-02','2018-01-04','2018-02-06',
'2018-01-01','2018-01-24','2018-02-10','2018-01-24']
})
df['date_ym'] = pd.to_datetime(df['date']).dt.strftime('%Y%#m') # switch # with - on linux
def func(x):
cnt = len(x)
cnt_A = sum(x == 'A')
return pd.Series({
'total_count': cnt,
'total_type_A': cnt_A,
'percentage': cnt_A/cnt*100
})
df = df.groupby(['date_ym','state'])['type'].apply(func).unstack().reset_index()
print(df)
返回:
date_ym state total_count total_type_A percentage
0 20181 CA 2.0 1.0 50.0
1 20181 NY 2.0 1.0 50.0
2 20181 WY 1.0 0.0 0.0
3 20182 CA 1.0 0.0 0.0
4 20182 NY 1.0 1.0 100.0