我有一个数据框,df
Date inp name
0 2017-08-07 2.3.6 ABC
1 2017-08-07 2.3.6 ABC
2 2017-08-08 2.3.6 TAC
3 2017-08-22 2.5.9 TTT
4 2017-09-23 0.8.0 TAC
5 2017-10-09 2.3.6 ABC
6 2017-10-09 2.3.6 TAC
7 2017-10-09 2.3.6 TAC
8 2017-10-23 0.8.0 TAC
9 2017-11-08 6.2.6 ABC
然后我想通过按月分组来计算列中的出现次数:'name'和'inp'。数据帧df2应如下所示:
Date inp name count
2017-08 2.3.6 ABC 2
2017-08 2.3.6 TAC 1
2017-08 2.5.9 TTT 1
2017-09 0.8.0 TAC 1
2017-10 2.3.6 ABC 1
2017-10 2.3.6 TAC 2
2017-10 0.8.0 TAC 1
2017-11 6.2.6 ABC 1
然后,一个新的数据帧,df3如下:这是通过按月分组来按月计算出现次数(inp,name),并将日期索引更改为月份的单词,然后转动
Index 2.3.6ABC 2.3.6TAC 2.5.9TTT 0.8.0TAC 6.2.6ABC
August 2 1 1 0 0
September 0 0 0 1 0
October 1 2 0 1 0
November 0 0 0 0 1
但我有这样的代码:
df=pd.DataFrame(df, columns= ['Date','inp','name'])
df['Date']= pd.to_datetime(df['Date'], format= '"%m/%d/%Y %H:%M:%S 0"')
df = df.set_index(['Date'])
print(df)
df = df.loc['2017-08-01':'2017-11-30']
df2 = (df.groupby(df.index.date,'inp')['name']
.value_counts()
.rename_axis(('Date','inp','name'))
.reset_index(name='count'))
print (df2)
#Sum the total number of unique (name,inp) associated per month
df2.Date= pd.to_datetime(df2.Date)
df3 = df2.groupby( [pd.Grouper(key='Date', freq='1M'),'inp','name']) ["count"].sum().unstack().fillna(0)
df3.index = df3.index.strftime('%B')
print(df3)
但我一直在接受:
ValueError: No axis named inp for object type <class 'pandas.core.frame.DataFrame'>
包含我要删除包含2个以上零的列。例如,像这样的新数据框,我该怎么做呢?
Index 2.3.6ABC 2.3.6TAC 0.8.0TAC
August 2 1 0
September 0 0 1
October 1 2 1
November 0 0 0
答案 0 :(得分:1)
我认为您可以使用floor
而不是df['Date'].dt.date
使用[]
更快的解决方案,groupby
中的列表{/ 1}}:
df2 = (df.groupby([df['Date'].dt.floor('D'),'inp'])['name']
.value_counts()
.rename_axis(('Date','inp','name'))
.reset_index(name='count'))
print (df2)
Date inp name count
0 2017-08-07 2.3.6 ABC 2
1 2017-08-08 2.3.6 TAC 1
2 2017-08-22 2.5.9 TTT 1
3 2017-09-23 0.8.0 TAC 1
4 2017-10-09 2.3.6 TAC 2
5 2017-10-09 2.3.6 ABC 1
6 2017-10-23 0.8.0 TAC 1
7 2017-11-08 6.2.6 ABC 1
然后unstack
按第二和第三级,并NaN
将0
替换为.unstack(level=[1,2], fill_value=0)
:
df3 = (df2.groupby([pd.Grouper(key='Date', freq='1M'),'inp','name'])["count"]
.sum()
.unstack(level=[1,2], fill_value=0))
df3.columns = df3.columns.map(''.join)
df3.index = df3.index.strftime('%B')
print (df3)
2.3.6ABC 2.3.6TAC 2.5.9TTT 0.8.0TAC 6.2.6ABC
August 2 1 1 0 0
September 0 0 0 1 0
October 1 2 0 1 0
November 0 0 0 0 1
最后,boolean indexing
使用loc
删除了列:
df4 = df3.loc[:, df3.eq(0).sum() <= 2]
#same as
#df4 = df3.loc[:, (df3 == 0).sum() <= 2]
print (df4)
2.3.6ABC 2.3.6TAC 0.8.0TAC
August 2 1 0
September 0 0 1
October 1 2 1
November 0 0 0