Question

如果两种情况都满足，我正在寻找一种将空行追加到数据框的方法。条件是，如果在特定年份未找到索引ID，则代码将添加一个空行，该行具有索引“ ID”和年份，而其他列为空。最终目的是创建一个完美的面板数据集，其中每个观察值可以代表7次（基于年份），尽管某些观察值可能会有数据，例如1次或3次（这不是常数，但会不时变化）。除索引“ ID”和年份外，否则这些丢失的数据行将为空。

以下是我的数据框all_data当前外观的示例：

ID      Year      Data1      Data2
345     2010        3          2
345     2011        1          4
345     2012        5          2
345     2013        3          1
345     2014        3          1
345     2015        3          1
345     2016        3          1
123     2010        1          1
123     2012        0          2
123     2016        0          2

这是我正在寻找的示例。

ID      Year      Data1      Data2
345     2010        3          2
345     2011        1          4
345     2012        5          2
345     2013        3          1
345     2014        3          1
345     2015        3          1
345     2016        3          1
123     2010        1          1
123     2011                  
123     2012        0          2
123     2013
123     2014
123     2015
123     2016        0          2

我有200多个观测值和20个“数据”列，因此手动执行此操作会花费太多时间。这是我尝试过的，但是没有用。它返回相同的数据帧，并且不添加任何空行。 “缺失”是一个列表，其中包含可以从all_data数据框中找到的每个唯一ID。

missing = ['345', '123']
sub_dfs = []
for year in [ 2010, 2011, 2012, 2013, 2014, 2015, 2016 ]:
    sub_df = all_data.loc[ all_data[ 'Year' ] == year ].copy()
    if( year == 2010):
        sub_df.set_index( 'ID', inplace=True)
        sub_df.reindex(sub_df.index.union(missing))
    if (year == 2011):
        sub_df.set_index('ID', inplace=True)
        sub_df.reindex(sub_df.index.union(missing))
    if (year == 2012):
        sub_df.set_index('ID', inplace=True)
        sub_df.reindex(sub_df.index.union(missing))
    if (year == 2013):
        sub_df.set_index('ID', inplace=True)
        sub_df.reindex(sub_df.index.union(missing))
    if (year == 2014):
        sub_df.set_index('ID', inplace=True)
        sub_df.reindex(sub_df.index.union(missing))
    if (year == 2015):
        sub_df.set_index('ID', inplace=True)
        sub_df.reindex(sub_df.index.union(missing))
    if (year == 2016):
        sub_df.set_index('ID', inplace=True)
        sub_df.reindex(sub_df.index.union(missing))
    sub_dfs.append(sub_df)

new_data = pd.concat(sub_dfs)

提前感谢您的帮助！

Answer 1

将reindex创建的MultiIndex.from_product和Multiindex的{{1}}的所有unique值与ID一起使用，np.arange的最小值和最大值} s：

year

Answer 2

jezrael总是更快，但是我想在这里学习熊猫，所以这是我的尝试;）

我正在使用重采样方法：您想通过将数据重新映射到年度开始（'AS'）间隔来填充空白

首先将“年份”列转换为熊猫日期时间并将其设置为索引

df.Year = pd.to_datetime(df.Year, format="%Y")
df = df.set_index('Year')

然后我分别处理每个唯一的ID，并创建一个新的输出DataFrame

IDs = df.ID.unique()
newDf = pd.DataFrame()

处理循环

for ID in IDs:
    # resample to annual start (although end would also be OK)
    temp = df[df.ID==ID].resample('AS').sum()
    # fill in the blanks, now 0, with the wanted data
    temp[temp.ID==0] = pd.DataFrame({'ID':ID, 'Data1':'', 'Data2':''},
        index=temp[temp.ID==0].index)
    # concat this new data with the output frame
    newDf = pd.concat([newDf, temp])

最后通过删除索引并将日期时间转换回字符串来清理输出

newDf = newDf.reset_index()
newDf.Year = newDf.Year.dt.strftime('%Y')

结果：

    Year   ID Data1 Data2
0   2010  345     3     2
1   2011  345     1     4
2   2012  345     5     2
3   2013  345     3     1
4   2014  345     3     1
5   2015  345     3     1
6   2016  345     3     1
7   2010  123     1     1
8   2011  123            
9   2012  123     0     2
10  2013  123            
11  2014  123            
12  2015  123            
13  2016  123     0     2

熊猫创建完美的面板数据，根据条件附加空行

2 个答案: