假设我有一个包含以下列的数据框:日期,时间,日期,月份,年份,描述,价格,类型,制造商
使用pandas和value_counts()
,我可以获得列中每个唯一项目的计数:
df.manufacturer.value_counts()
另外,使用groupby我可以获得数据中每天的平均价格:
df.groupby("day").price.mean()
问题是共有7天,但在我的数据中可能只有5或6天,所以我需要添加缺失的天数,其中均值为零或无。
一般情况下,如果我有一个特定的列表,当我执行value_counts或groupby操作时,如何包含缺少的项?
答案 0 :(得分:0)
我认为您可以将天数转换为categorical
s,因此如果使用groupby + mean
获取NaN
个缺少的类别:
df = pd.DataFrame({
'day': ['Monday','Tuesday','Tuesday','Tuesday','Thursday'],
'price': list(range(5))
})
print (df)
day price
0 Monday 0
1 Tuesday 1
2 Tuesday 2
3 Tuesday 3
4 Thursday 4
cats = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
df['day'] = pd.Categorical(df['day'], categories=cats, ordered=True)
print(df.groupby("day", as_index=False).price.mean())
day price
0 Monday 0.0
1 Tuesday 2.0
2 Wednesday NaN
3 Thursday 4.0
4 Friday NaN
5 Saturday NaN
6 Sunday NaN
所有可能类别的另一个解决方案是reindex
:
cats = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
print(df.groupby("day").price.mean().reindex(cats))
day
Monday 0.0
Tuesday 2.0
Wednesday NaN
Thursday 4.0
Friday NaN
Saturday NaN
Sunday NaN
Name: price, dtype: float64
print(df.groupby("day").price.mean().reindex(cats, fill_value=0))
day
Monday 0
Tuesday 2
Wednesday 0
Thursday 4
Friday 0
Saturday 0
Sunday 0
Name: price, dtype: int64
答案 1 :(得分:0)
您应该使用Categorical Data。以下是一个最小的例子。
import pandas as pd
df = pd.DataFrame([['Monday', 5], ['Monday', 6], ['Monday', 3],
['Tuesday', 1], ['Tuesday', 8]],
columns=['day', 'value'])
# list days in order
day_list = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
# convert to categorical
df['day'] = df['day'].astype('category')
# set categories and ordered=True
df['day'] = df['day'].cat.set_categories(day_list, ordered=True)
# perform groupby and fillna with 0
res = df.groupby('day').mean().fillna(0)
结果:
value
day
Monday 4.666667
Tuesday 4.500000
Wednesday 0.000000
Thursday 0.000000
Friday 0.000000
Saturday 0.000000
Sunday 0.000000