熊猫:删除带有列值的行,&删除其他子目录

时间:2016-02-21 03:32:43

标签: python pandas

我的问题最好用视觉解释。关于Pandas数据框中如何删除具有某些列值的行有很多答案,但我不确定处理我想要做的其他步骤的最佳方法。

country series 2005 2006 2007 2008 2009

AFG     net m  ..   ..    5    ..  10
AFG     battle 100  50   55   60   100
AFG     GDP    200  100  150  200  250
AFG     info   ..   ..   ..   ..   ..  
AFG     life   60   ..    61  63   64
AFG     unemp  5.7  5.9  6.0  5.4  5.3
ALB     net m  ..   ..    5    ..  10
ALB     battle 100  50   55   60   100
ALB     GDP    200  100  150  200  250
ALB     info   ..   45   ..   99   ..  
ALB     life   78   ..    61  63   64
ALB     unemp  ..   ..   ..   ..   3.2
and so on for other countries

我会查看每个国家/地区列出的系列,每年至少有2个值。如果存在少于2个值,请删除该行。但是,如果某个国家/地区已将其删除,请将其删除,如果其他国家/地区的情况不适用。

我的输出是:

country series 2005 2006 2007 2008 2009

AFG     net m  ..   ..    5    ..  10
AFG     battle 100  50   55   60   100
AFG     GDP    200  100  150  200  250
AFG     life   60   ..    61  63   64
ALB     net m  ..   ..    5    ..  10
ALB     battle 100  50   55   60   100
ALB     GDP    200  100  150  200  250
ALB     life   78   ..    61  63   64

其中AFG的信息被移除,因为年份列中没有值,但ALB和所有其他国家/地区也没有。删除了失败因为ALB只存在一个值,但所有其他国家也删除了。

感谢您的时间和任何反馈

3 个答案:

答案 0 :(得分:1)

df1  = pd.DataFrame( [["AFG", "net m", "", "", 5, "", 10],
                      ["AFG", "battle", 100, 50 , 55, 60 , 100],
                     ["AFG", "GDP", 200, 100 , 150, 200 , 250],
                      ["AFG", "info", "", "" , "", "" , ""],
                      ["AFG", "life", 60, "" , 61, 63 , 64],
                      ["ALB", "net m", "", "", 5, "", 10],
                      ["ALB", "battle", 100, 50 , 55, 60 , 100],
                     ["ALB", "GDP", 200, 100 , 150, 200 , 250],
                      ["ALB", "info", "", 45 , "", 99 , ""],

                      ["ALB", "life", 78, "" , 61, 63 , 64],

                     ],columns = ["country", "series", 2005,2006,2007,2008,2009])
list_of_series_to_exclude = []
for i in df1["country"].unique(): #loop over unique countries
    for row in df1[df1["country"]==i].iterrows(): #loop over a slice of original dataframe, based on current country
        series = row[1][1] # keep track of current series
        years = pd.Series([x for x in row[1][2:]]) # year columns
        x = dict(years.value_counts(sort=True)) # get in dictionary form the counts for unique values in the year columns
        try:
            if x[''] > len(row[1][2:])-2: 
                list_of_series_to_exclude.append(series)
        except KeyError:
            pass #Row doesnt have blank value
final_set = set(df1["series"])
set_to_sub = set(list_of_series_to_exclude)

final_list = list(final_set-set_to_sub)
df1 = df1[df1["series"].isin(final_list)]        

输出:

print df1
 country  series 2005 2006 2007 2008 2009
0     AFG   net m              5        10
1     AFG  battle  100   50   55   60  100
2     AFG     GDP  200  100  150  200  250
4     AFG    life   60        61   63   64
5     ALB   net m              5        10
6     ALB  battle  100   50   55   60  100
7     ALB     GDP  200  100  150  200  250
9     ALB    life   78        61   63   64

答案 1 :(得分:1)

我假设空字段表示为NaN。您可以先使用isnullsum提取少于两个有效值的行,然后使用关联的“系列”值过滤原始DataFrame isin

mask = (~df[range(2005,2010)].isnull()).sum(axis=1) < 2
print df[~df.series.isin(df[mask].series)]

输出:

  country  series  2005  2006  2007  2008  2009
0     AFG   net m   NaN   NaN     5   NaN    10
1     AFG  battle   100    50    55    60   100
2     AFG     GDP   200   100   150   200   250
4     AFG    life    60   NaN    61    63    64
5     ALB   net m   NaN   NaN     5   NaN    10
6     ALB  battle   100    50    55    60   100
7     ALB     GDP   200   100   150   200   250
9     ALB    life    78   NaN    61    63    64

答案 2 :(得分:1)

import pandas as pd
import numpy as np
df1 = pd.DataFrame({
 '2005': ['..', '100', '200', '..', '60', '5.7', '..', '100', '200', '..', '78', '..'],
 '2006': ['..', '50', '100', '..', '..', '5.9', '..', '50', '100', '45', '..', '..'],
 '2007': ['5', '55', '150', '..', '61', '6.0', '5', '55', '150', '..', '61','..'],
 '2008': ['..', '60', '200', '..', '63', '5.4', '..', '60', '200', '99', '63', '..'],
 '2009': ['10', '100', '250', '..', '64', '5.3', '10', '100', '250', '..', '64', '3.2'],
 'country': ['AFG', 'AFG', 'AFG', 'AFG', 'AFG', 'AFG', 'ALB', 'ALB', 'ALB', 'ALB', 'ALB', 'ALB'],
 'series': ['net m', 'battle', 'GDP', 'info', 'life', 'unemp', 'net m', 'battle', 'GDP', 'info', 'life', 'unemp']},
columns=['country', 'series', '2005', '2006', '2007', '2008', '2009']).replace('..', np.nan)

我选择创建一个名为Count的虚拟列,以便根据Count是否小于2来直观显示要删除的行。

df1['Count'] = df1.loc[:, '2005':].count(axis=1)
   country  series 2005 2006 2007 2008 2009  Count
0      AFG   net m  NaN  NaN    5  NaN   10      2
1      AFG  battle  100   50   55   60  100      5
2      AFG     GDP  200  100  150  200  250      5
3      AFG    info  NaN  NaN  NaN  NaN  NaN      0
4      AFG    life   60  NaN   61   63   64      4
5      AFG   unemp  5.7  5.9  6.0  5.4  5.3      5
6      ALB   net m  NaN  NaN    5  NaN   10      2
7      ALB  battle  100   50   55   60  100      5
8      ALB     GDP  200  100  150  200  250      5
9      ALB    info  NaN   45  NaN   99  NaN      2
10     ALB    life   78  NaN   61   63   64      4
11     ALB   unemp  NaN  NaN  NaN  NaN  3.2      1

接下来检查series值是否在与Count小于2的行相关联的值列表中。 ~然后将结果排除在外。

df1[~df1['series'].isin(df1[df1['Count'] < 2]['series'].tolist())]
#Produces:
   country  series 2005 2006 2007 2008 2009  Count
0      AFG   net m  NaN  NaN    5  NaN   10      2
1      AFG  battle  100   50   55   60  100      5
2      AFG     GDP  200  100  150  200  250      5
4      AFG    life   60  NaN   61   63   64      4
6      ALB   net m  NaN  NaN    5  NaN   10      2
7      ALB  battle  100   50   55   60  100      5
8      ALB     GDP  200  100  150  200  250      5
10     ALB    life   78  NaN   61   63   64      4