Question

我有一个数据集，显示了从1970年到2013年的100多个国家的年均增长指标。并非所有国家都具有所有年份的数据，而最少年份的国家具有30年的数据。我想弄清楚所有国家/地区向我显示30年的数据，而从那些国家/地区中删除30年以上的数据。我在下面提供一个示例。

我曾考虑过使用循环从数据框中删除数据，直到所有国家/地区都出现30次，然后构建一个全新的数据框，但是我想相信有更好的解决方案。

import pandas as pd

data = {'Country':['Israel','Congo','Denmark',
                   'Israel','Denmark',
                   'Israel','Congo',
                   'Israel','Congo','Denmark'],
        'Year':[2000,2000,2000,
                2001,2001,
                2002,2002,
                2003,2003,2003],
        'Value':[2.5,1.2,3.1,2.8,1.1,2.9,3.1,1.9,3.0,3.1]}
df = pd.DataFrame(data=data)
df
   Country  Year  Value
0   Israel  2000    2.5
1    Congo  2000    1.2
2  Denmark  2000    3.1
3   Israel  2001    2.8
4  Denmark  2001    1.1
5   Israel  2002    2.9
6    Congo  2002    3.1
7   Israel  2003    1.9
8    Congo  2003    3.0
9  Denmark  2003    3.1

上面的代码创建了一个仅包含3个国家和4年示例的数据框。从数据框中，您可以看到以色列拥有4年的数据，而丹麦和刚果只有3年的数据。我想从以色列撤走一年，以便所有国家都有三年。在实际数据框中，我想从30年以上的国家/地区中删除年份，以便所有国家/地区都具有相同的年份，最好删除值最小的年份。

这是我使用for循环的解决方案，其中使用了很多行代码：

gp = df.groupby('Country').groups #Group by country name
d = {} #Build dictionary Country Name => index list.

for i in gp: #Iterate over all countries until a list of 3 indeces is 
#reached for each country.
    d[i] = []
    for j in gp[i]:
        if len(d[i])<3: #A country appears once every year in the dataset,
#3 means 3 years. If a country appears more than 3 times, it will only 
#include the indices of the first 3 occurrences. 
            d[i].append(j)
indeces = [] #Gather the indeces to keep in the dataframe.
for i in d:
    for j in d[i]:
        if len(d[i])==3: #make sure the list has exactly 3 items
            indeces.append(j)

final_df = df.loc[indeces,['Country','Year','Value']]
final_df
#Now I have one less value for Israel, so all countries have 3 values.
   Country  Year  Value
1    Congo  2000    1.2
6    Congo  2002    3.1
8    Congo  2003    3.0
2  Denmark  2000    3.1
4  Denmark  2001    1.1
9  Denmark  2003    3.1
0   Israel  2000    2.5
3   Israel  2001    2.8
5   Israel  2002    2.9

Answer 1

您可以从year列中的唯一值创建最近几年的列表，并使用布尔索引来使用该列表为数据框建立索引。

recent_years = df.Year.unique()[-3:]
df[df.Year.isin(recent_years)]

    Country Year    Value
3   Israel  2001    2.8
4   Denmark 2001    1.1
5   Israel  2002    2.9
6   Congo   2002    3.1
7   Israel  2003    1.9
8   Congo   2003    3.0
9   Denmark 2003    3.1

如果您的Year值不一定是按顺序排列的，请使用numpy unique，它返回的排序数组不同于pandas unique（）

recent_years = np.unique(df.Year)[-3:]
df[df.Year.isin(recent_years)]

这是另一个解决方案，可以为每个国家/地区返回最近的3年。如果未按年份对数据进行排序，则需要首先对其进行排序。

idx = df.groupby('Country').apply(lambda x: x['Year'].tail(3)).index
df.set_index(['Country', df.index]).reindex(idx).reset_index().drop('level_1', 1)

    Country Year    Value
0   Congo   2000    1.2
1   Congo   2002    3.1
2   Congo   2003    3.0
3   Denmark 2000    3.1
4   Denmark 2001    1.1
5   Denmark 2003    3.1
6   Israel  2001    2.8
7   Israel  2002    2.9
8   Israel  2003    1.9

如果未对数据进行排序，请先使用

df = df.sort_values(by = 'Year')

Answer 2

这是我使用熊猫的解决方案。即使使用了许多代码行，它也可以完成它必须要做的事情。感谢@Vaishali的帮助：

threshold = 3 #Anything that occurs less than this will be removed, 
              #if it ocurrs more, the extra ocurrences with the least values 
              #will be removed.
newIndex = df.set_index('Country')#set new index to make selection by   
                                  #index posible.
values = newIndex.index.value_counts() #Count occurrences of index values.
to_keep = values[values>=threshold].index.values 
#Keep index values that ocurr >= threshold.
rank_df = newIndex.loc[to_keep,['Value','Year']]#Select rows and  
                                                #columns to keep.

#Sort values in descending order before meeting threshold.
rank_df = rank_df.sort_values('Value',ascending=False)
rank_df = rank_df.groupby(rank_df.index).head(threshold)#group again 
#Since values are sorted, head() will show highest values
rank_df = rank_df.groupby([rank_df.index,'Year']).mean() \
              .sort_values('Value',ascending=False)

#Finally, reset index to convert Year index into a column, and sort by year
rank_df.reset_index(level=1).sort_values('Year')

输出：

            Year    Value
Country         
Denmark     2000    3.1
Israel      2000    2.5
Congo       2000    1.2
Israel      2001    2.8
Denmark     2001    1.1
Congo       2002    3.1
Israel      2002    2.9
Denmark     2003    3.1
Congo       2003    3.0

选择一个数据帧的子集，每个变量具有N年的数据价值

2 个答案: