我的问题最好用视觉解释。关于Pandas数据框中如何删除具有某些列值的行有很多答案,但我不确定处理我想要做的其他步骤的最佳方法。
country series 2005 2006 2007 2008 2009
AFG net m .. .. 5 .. 10
AFG battle 100 50 55 60 100
AFG GDP 200 100 150 200 250
AFG info .. .. .. .. ..
AFG life 60 .. 61 63 64
AFG unemp 5.7 5.9 6.0 5.4 5.3
ALB net m .. .. 5 .. 10
ALB battle 100 50 55 60 100
ALB GDP 200 100 150 200 250
ALB info .. 45 .. 99 ..
ALB life 78 .. 61 63 64
ALB unemp .. .. .. .. 3.2
and so on for other countries
我会查看每个国家/地区列出的系列,每年至少有2个值。如果存在少于2个值,请删除该行。但是,如果某个国家/地区已将其删除,请将其删除,如果其他国家/地区的情况不适用。
我的输出是:
country series 2005 2006 2007 2008 2009
AFG net m .. .. 5 .. 10
AFG battle 100 50 55 60 100
AFG GDP 200 100 150 200 250
AFG life 60 .. 61 63 64
ALB net m .. .. 5 .. 10
ALB battle 100 50 55 60 100
ALB GDP 200 100 150 200 250
ALB life 78 .. 61 63 64
其中AFG的信息被移除,因为年份列中没有值,但ALB和所有其他国家/地区也没有。删除了失败因为ALB只存在一个值,但所有其他国家也删除了。
感谢您的时间和任何反馈
答案 0 :(得分:1)
df1 = pd.DataFrame( [["AFG", "net m", "", "", 5, "", 10],
["AFG", "battle", 100, 50 , 55, 60 , 100],
["AFG", "GDP", 200, 100 , 150, 200 , 250],
["AFG", "info", "", "" , "", "" , ""],
["AFG", "life", 60, "" , 61, 63 , 64],
["ALB", "net m", "", "", 5, "", 10],
["ALB", "battle", 100, 50 , 55, 60 , 100],
["ALB", "GDP", 200, 100 , 150, 200 , 250],
["ALB", "info", "", 45 , "", 99 , ""],
["ALB", "life", 78, "" , 61, 63 , 64],
],columns = ["country", "series", 2005,2006,2007,2008,2009])
list_of_series_to_exclude = []
for i in df1["country"].unique(): #loop over unique countries
for row in df1[df1["country"]==i].iterrows(): #loop over a slice of original dataframe, based on current country
series = row[1][1] # keep track of current series
years = pd.Series([x for x in row[1][2:]]) # year columns
x = dict(years.value_counts(sort=True)) # get in dictionary form the counts for unique values in the year columns
try:
if x[''] > len(row[1][2:])-2:
list_of_series_to_exclude.append(series)
except KeyError:
pass #Row doesnt have blank value
final_set = set(df1["series"])
set_to_sub = set(list_of_series_to_exclude)
final_list = list(final_set-set_to_sub)
df1 = df1[df1["series"].isin(final_list)]
输出:
print df1
country series 2005 2006 2007 2008 2009
0 AFG net m 5 10
1 AFG battle 100 50 55 60 100
2 AFG GDP 200 100 150 200 250
4 AFG life 60 61 63 64
5 ALB net m 5 10
6 ALB battle 100 50 55 60 100
7 ALB GDP 200 100 150 200 250
9 ALB life 78 61 63 64
答案 1 :(得分:1)
我假设空字段表示为NaN。您可以先使用isnull
和sum
提取少于两个有效值的行,然后使用关联的“系列”值过滤原始DataFrame isin
:
mask = (~df[range(2005,2010)].isnull()).sum(axis=1) < 2
print df[~df.series.isin(df[mask].series)]
输出:
country series 2005 2006 2007 2008 2009
0 AFG net m NaN NaN 5 NaN 10
1 AFG battle 100 50 55 60 100
2 AFG GDP 200 100 150 200 250
4 AFG life 60 NaN 61 63 64
5 ALB net m NaN NaN 5 NaN 10
6 ALB battle 100 50 55 60 100
7 ALB GDP 200 100 150 200 250
9 ALB life 78 NaN 61 63 64
答案 2 :(得分:1)
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'2005': ['..', '100', '200', '..', '60', '5.7', '..', '100', '200', '..', '78', '..'],
'2006': ['..', '50', '100', '..', '..', '5.9', '..', '50', '100', '45', '..', '..'],
'2007': ['5', '55', '150', '..', '61', '6.0', '5', '55', '150', '..', '61','..'],
'2008': ['..', '60', '200', '..', '63', '5.4', '..', '60', '200', '99', '63', '..'],
'2009': ['10', '100', '250', '..', '64', '5.3', '10', '100', '250', '..', '64', '3.2'],
'country': ['AFG', 'AFG', 'AFG', 'AFG', 'AFG', 'AFG', 'ALB', 'ALB', 'ALB', 'ALB', 'ALB', 'ALB'],
'series': ['net m', 'battle', 'GDP', 'info', 'life', 'unemp', 'net m', 'battle', 'GDP', 'info', 'life', 'unemp']},
columns=['country', 'series', '2005', '2006', '2007', '2008', '2009']).replace('..', np.nan)
我选择创建一个名为Count
的虚拟列,以便根据Count
是否小于2
来直观显示要删除的行。
df1['Count'] = df1.loc[:, '2005':].count(axis=1)
country series 2005 2006 2007 2008 2009 Count
0 AFG net m NaN NaN 5 NaN 10 2
1 AFG battle 100 50 55 60 100 5
2 AFG GDP 200 100 150 200 250 5
3 AFG info NaN NaN NaN NaN NaN 0
4 AFG life 60 NaN 61 63 64 4
5 AFG unemp 5.7 5.9 6.0 5.4 5.3 5
6 ALB net m NaN NaN 5 NaN 10 2
7 ALB battle 100 50 55 60 100 5
8 ALB GDP 200 100 150 200 250 5
9 ALB info NaN 45 NaN 99 NaN 2
10 ALB life 78 NaN 61 63 64 4
11 ALB unemp NaN NaN NaN NaN 3.2 1
接下来检查series
值是否在与Count
小于2
的行相关联的值列表中。 ~
然后将结果排除在外。
df1[~df1['series'].isin(df1[df1['Count'] < 2]['series'].tolist())]
#Produces:
country series 2005 2006 2007 2008 2009 Count
0 AFG net m NaN NaN 5 NaN 10 2
1 AFG battle 100 50 55 60 100 5
2 AFG GDP 200 100 150 200 250 5
4 AFG life 60 NaN 61 63 64 4
6 ALB net m NaN NaN 5 NaN 10 2
7 ALB battle 100 50 55 60 100 5
8 ALB GDP 200 100 150 200 250 5
10 ALB life 78 NaN 61 63 64 4