简单的示例数据框
df = pd.DataFrame({
'year': [1900, 1901, 1901, 1901, 1902, 1903, 1903, 1903, 1905]
})
我有下面的函数可以接收熊猫数据框:
def my_function(df):
df = df.groupby(['year']) # group the df by year
new_df = pd.DataFrame() # make a new empty df
new_df['frequency'] = df['year'].count() # get frequency counts for each year
return new_df
但是,此输出没有为我提供0
缺少年份的频率计数。
my_function(df)的理想输出:
year frequency
1900 1
1901 3
1902 1
1903 3
1904 0
1905 1
Current output of my_function(df):
1900 1
1901 3
1902 1
1903 3
1905 1
我认为我与pd.reindex()关系密切,但需要一些指导。
我已经扫描了docs for pd.reindex(),并尝试查看this stackoverflow post和this one,但仍然无法解决。
我在新变量中定义了理想年份的范围
new_idx = range(1900, 1905)
然后尝试实现pd.reindex()
,如下所示:
new_df.reindex(new_idx, fill_value=0)
这导致功能有所不同,现在看起来像这样:
def my_function(df):
new_idx = range(1900, 1905)
df = df.groupby(['year'])
new_df = pd.DataFrame()
new_df['frequency'] = df['year'].count()
new_df = new_df.reindex(new_idx, fill_value=0)
return new_df
但是,这会产生一个新的pd.dataframe(),它是我想要的大小(new_idx中的年长),但是它将所有频率值都覆盖为0,而不仅仅是“加”年。
经过微调的my_function(df)
的理想输出:
year frequency
1900 1
1901 3
1902 1
1903 3
1904 0
1905 1
略微调整了my_function(df)
的当前输出:
year frequency
1900 0
1901 0
1902 0
1903 0
1904 0
1905 0
答案 0 :(得分:1)
您要Series.value_counts
+ Series.reindex
# Assuming you saved the above in the same directory in a file called model_fields.py
from .model_fields import LowercaseEmailField
class UserModel(AbstractBaseUser, PermissionsMixin):
email = LowercaseEmailField(unique=True)
# other stuff...
new_df=( df['year'].value_counts()
.reindex(range(df['year'].min(),df['year'].max()+1),fill_value=0)
.rename_axis('year')
.reset_index(name='frecuency')
)
print(new_df)
答案 1 :(得分:0)
您可以使用pd.date_range + pd.Series.value_counts:
import pandas as pd
df = pd.DataFrame({
'year': [1900, 1901, 1901, 1901, 1902, 1903, 1903, 1903, 1905]
})
# generate date range between minimum and maximum year, with yearly frequency
range = pd.date_range(start=pd.datetime(df.min(), 1, 1), end=pd.datetime(df.max(), 1, 1), freq='AS')
# fill each year with the corresponding count set missing to zero
result = pd.Series(range.year, index=range.year).map(df.year.value_counts()).fillna(0)
print(result)
输出
1900 1.0
1901 3.0
1902 1.0
1903 3.0
1904 0.0
1905 1.0
dtype: float64
该想法是在第一年和最后一年之间生成一个年度日期范围,并将年份的频率映射到该范围。