循环中的Python动态子集

时间:2018-08-12 03:17:35

标签: python pandas dataframe conditional

我有以下数据框。一列中有多个县名,表中的日期和值也是如此。衰退前的最大值是特定县在特定时间范围内的最大值(因为并非每个县都立即经历相同的值下降)。我需要找出行唯一的最短日期与该值反弹之间的时间(当最小值之后的下一列中的值等于或大于衰退前的最大值时)。

我是Python的新手,也是stackoverflow的新手,并且花了一周的时间进行在线研究,但没有成功。

Dataframe

Final result

以下代码可以工作并评估df中所有大于51000的值。问题是:如何动态子集df?谢谢。

df
revcols = df.columns.values.tolist()
revcols.reverse()
tmpdf=tmpdf= df>51000
final=tmpdf[tmpdf.any(axis=1)].idxmax(axis=1)
final

2 个答案:

答案 0 :(得分:1)

使用:

df = df.set_index(['County','Prerecession Max Value'])

a = df.idxmin(axis=1)
m1 = df.eq(df.min(axis=1), axis=0).cumsum(axis=1).gt(0)
m2 = df.sub(df.index.get_level_values(1), axis=0).ge(0)
b = (m1 & m2).idxmax(axis=1)

d = {'Date of Min': a, 'Date of Max':b}
df = df.assign(**d).reset_index()
print (df)
     County  Prerecession Max Value   2007   2008   2009   2010   2011   2012  \
0  County 1                  100000  90000  81000  72900  65610  70000  80000   
1  County 2                   20000  18000  16000  21000  22000  23000  24000   
2  County 3                   10000   9000   8100   7290   6561   5905   6405   
3  County 4                    6000   6000   4860   4374   4474   4574   6001   

    2013    2014    2015 Date of Min Date of Max  
0  90000  100000  110000        2010        2014  
1  25000   26000   27000        2008        2009  
2   6905   12405   13405        2011        2014  
3   7000    7500    7900        2009        2012 

设置 :(将最小年份后的2007列的最后一个值更改为6000以进行测试匹配)

import pandas as pd

temp=u"""
County;Prerecession Max Value;2007;2008;2009;2010;2011;2012;2013;2014;2015
County 1;100,000;90,000;81,000;72,900;65,610;70,000;80,000;90,000;100,000;110,000
County 2;20,000;18,000;16,000;21,000;22,000;23,000;24,000;25,000;26,000;27,000
County 3;10,000;9,000;8,100;7,290;6,561;5,905;6,405;6,905;12,405;13,405
County 4;6,000;6,000;4,860;4,374;4,474;4,574;6,001;7,000;7,500;7,900"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), sep=";", thousands=',')
print (df)
     County  Prerecession Max Value   2007   2008   2009   2010   2011   2012  \
0  County 1                  100000  90000  81000  72900  65610  70000  80000   
1  County 2                   20000  18000  16000  21000  22000  23000  24000   
2  County 3                   10000   9000   8100   7290   6561   5905   6405   
3  County 4                    6000   6000   4860   4374   4474   4574   6001   

    2013    2014    2015  
0  90000  100000  110000  
1  25000   26000   27000  
2   6905   12405   13405  
3   7000    7500    7900  

说明

首先创建DataFrame.set_index中没有日期列的MultiIndex

df = df.set_index(['County','Prerecession Max Value'])
print (df)
                                  2007   2008   2009   2010   2011   2012  \
County   Prerecession Max Value                                             
County 1 100000                  90000  81000  72900  65610  70000  80000   
County 2 20000                   18000  16000  21000  22000  23000  24000   
County 3 10000                    9000   8100   7290   6561   5905   6405   
County 4 6000                     6000   4860   4374   4474   4574   6001   

                                  2013    2014    2015  
County   Prerecession Max Value                         
County 1 100000                  90000  100000  110000  
County 2 20000                   25000   26000   27000  
County 3 10000                    6905   12405   13405  
County 4 6000                     7000    7500    7900  

要使用最少的日期,请使用DataFrame.idxmin

print (df.idxmin(axis=1))
County    Prerecession Max Value
County 1  100000                    2010
County 2  20000                     2008
County 3  10000                     2011
County 4  6000                      2009
dtype: object

然后需要过滤每行最小值之后的所有值-首先将min的值与DataFrame.eq的值进行比较:

print (df.eq(df.min(axis=1), axis=0))

                                  2007   2008   2009   2010   2011   2012  \
County   Prerecession Max Value                                             
County 1 100000                  False  False  False   True  False  False   
County 2 20000                   False   True  False  False  False  False   
County 3 10000                   False  False  False  False   True  False   
County 4 6000                    False  False   True  False  False  False   

                                  2013   2014   2015  
County   Prerecession Max Value                       
County 1 100000                  False  False  False  
County 2 20000                   False  False  False  
County 3 10000                   False  False  False  
County 4 6000                    False  False  False  

DataFrame.cumsum使用每行的累积总和

print (df.eq(df.min(axis=1), axis=0).cumsum(axis=1))
                                 2007  2008  2009  2010  2011  2012  2013  \
County   Prerecession Max Value                                             
County 1 100000                     0     0     0     1     1     1     1   
County 2 20000                      0     1     1     1     1     1     1   
County 3 10000                      0     0     0     0     1     1     1   
County 4 6000                       0     0     1     1     1     1     1   

                                 2014  2015  
County   Prerecession Max Value              
County 1 100000                     1     1  
County 2 20000                      1     1  
County 3 10000                      1     1  
County 4 6000                       1     1  

并按DataFrame.gt进行比较:

print (df.eq(df.min(axis=1), axis=0).cumsum(axis=1).gt(0))
                                  2007   2008   2009   2010  2011  2012  2013  \
County   Prerecession Max Value                                                 
County 1 100000                  False  False  False   True  True  True  True   
County 2 20000                   False   True   True   True  True  True  True   
County 3 10000                   False  False  False  False  True  True  True   
County 4 6000                    False  False   True   True  True  True  True   

                                 2014  2015  
County   Prerecession Max Value              
County 1 100000                  True  True  
County 2 20000                   True  True  
County 3 10000                   True  True  
County 4 6000                    True  True  

然后创建另一个蒙版-减去Index.get_level_valuesDataFrame.sub选择的MultiIndex的第二级:

print (df.index.get_level_values(1))
Int64Index([100000, 20000, 10000, 6000], 
           dtype='int64', name='Prerecession Max Value')

print (df.sub(df.index.get_level_values(1), axis=0))
                                  2007   2008   2009   2010   2011   2012  \
County   Prerecession Max Value                                             
County 1 100000                 -10000 -19000 -27100 -34390 -30000 -20000   
County 2 20000                   -2000  -4000   1000   2000   3000   4000   
County 3 10000                   -1000  -1900  -2710  -3439  -4095  -3595   
County 4 6000                        0  -1140  -1626  -1526  -1426      1   

                                  2013  2014   2015  
County   Prerecession Max Value                      
County 1 100000                 -10000     0  10000  
County 2 20000                    5000  6000   7000  
County 3 10000                   -3095  2405   3405  
County 4 6000                     1000  1500   1900  

然后将>=0的{​​{3}}进行比较:

print (df.sub(df.index.get_level_values(1), axis=0).ge(0))
                                  2007   2008   2009   2010   2011   2012  \
County   Prerecession Max Value                                             
County 1 100000                  False  False  False  False  False  False   
County 2 20000                   False  False   True   True   True   True   
County 3 10000                   False  False  False  False  False  False   
County 4 6000                    True   False  False  False  False   True   

                                  2013  2014  2015  
County   Prerecession Max Value                     
County 1 100000                  False  True  True  
County 2 20000                    True  True  True  
County 3 10000                   False  True  True  
County 4 6000                     True  True  True 

&来对AND进行布尔掩码约束,并用DataFrame.ge来获取每行的前True列名称:

print ((m1 & m2))
                                  2007   2008   2009   2010   2011   2012  \
County   Prerecession Max Value                                             
County 1 100000                  False  False  False  False  False  False   
County 2 20000                   False  False   True   True   True   True   
County 3 10000                   False  False  False  False  False  False   
County 4 6000                    False  False  False  False  False   True   

                                  2013  2014  2015  
County   Prerecession Max Value                     
County 1 100000                  False  True  True  
County 2 20000                    True  True  True  
County 3 10000                   False  True  True  
County 4 6000                     True  True  True  

print ((m1 & m2).idxmax(axis=1))
County    Prerecession Max Value
County 1  100000                    2014
County 2  20000                     2009
County 3  10000                     2014
County 4  6000                      2012
dtype: object

DataFrame.idxmax创建新列的字典:

d = {'Date of Min': a, 'Date of Max':b}
df = df.assign(**d)
print (df)
                                  2007   2008   2009   2010   2011   2012  \
County   Prerecession Max Value                                             
County 1 100000                  90000  81000  72900  65610  70000  80000   
County 2 20000                   18000  16000  21000  22000  23000  24000   
County 3 10000                    9000   8100   7290   6561   5905   6405   
County 4 6000                     6000   4860   4374   4474   4574   6001   

                                  2013    2014    2015 Date of Min Date of Max  
County   Prerecession Max Value                                                 
County 1 100000                  90000  100000  110000        2010        2014  
County 2 20000                   25000   26000   27000        2008        2009  
County 3 10000                    6905   12405   13405        2011        2014  
County 4 6000                     7000    7500    7900        2009        2012  

最后assign代表MultiIndex中的列。

答案 1 :(得分:0)

感谢您发布此问题。我提出了解决此问题的方法,如下所示:

我用问题陈述中提供的示例数据创建了一个“ csv”文件,并将其命名为stack.csv。我在此csv中添加了三个新列,这些列将保存以下内容的计算值:

  1. MinVal_Year -县具有最小值的年份
  2. Rebound_Year -值从衰退前的值反弹的年份
  3. TimeDiff -最小值年份到反弹年份之间的时间

这些列中最初有null或NaN。

enter image description here

现在,我们可以看一下我编写的解决方案:

#Loading the CSV file into a data frame
df = pd.read_csv('stack.csv')

#Transposing the county and year columns to create a subset in order to fetch minimum value for each year
df_subset=df[['county','2007','2008','2009','2010','2011','2012','2013','2014','2015']]
df_subset_transposed = df_subset.T
df_subset_transposed.rename(columns={0:'county1'}, inplace=True)
df_subset_transposed.rename(columns={1:'county2'}, inplace=True)
df_subset_transposed.rename(columns={2:'county3'}, inplace=True)
df_subset_transposed.rename(columns={3:'county4'}, inplace=True)
df_subset_transposed.drop(['county'],inplace=True)
df_subset_transposed.index.names=['year']
df['MinVal_Year'][df['county']=='county1'] = pd.to_numeric(df_subset_transposed[('county1')]).idxmin() 
df['MinVal_Year'][df['county']=='county2'] = pd.to_numeric(df_subset_transposed[('county2')]).idxmin() 
df['MinVal_Year'][df['county']=='county3'] = pd.to_numeric(df_subset_transposed[('county3')]).idxmin() 
df['MinVal_Year'][df['county']=='county4'] = pd.to_numeric(df_subset_transposed[('county4')]).idxmin() 

#Iterating the main data frame couny wise to fetch which year is the rebound year
j=0
for i in df['county']:
    if df[df['county']==i]['2007'][j] >= df[df['county']==i]['prerecession val'][j]:
        df.set_value(j,'Rebound_Year','2007') 
    if df[df['county']==i]['2008'][j] >= df[df['county']==i]['prerecession val'][j]:
        df.set_value(j,'Rebound_Year','2008')
    elif df[df['county']==i]['2009'][j] >= df[df['county']==i]['prerecession val'][j]:
        df.set_value(j,'Rebound_Year','2009')
    elif df[df['county']==i]['2010'][j] >= df[df['county']==i]['prerecession val'][j]:
        df.set_value(j,'Rebound_Year','2010')
    elif df[df['county']==i]['2011'][j] >= df[df['county']==i]['prerecession val'][j]:
        df.set_value(j,'Rebound_Year','2011')
    elif df[df['county']==i]['2012'][j] >= df[df['county']==i]['prerecession val'][j]:
        df.set_value(j,'Rebound_Year','2012')
    elif df[df['county']==i]['2013'][j] >= df[df['county']==i]['prerecession val'][j]:
        df.set_value(j,'Rebound_Year','2013')
    elif df[df['county']==i]['2014'][j] >= df[df['county']==i]['prerecession val'][j]:
        df.set_value(j,'Rebound_Year','2014')
    elif df[df['county']==i]['2015'][j] >= df[df['county']==i]['prerecession val'][j]:
        df.set_value(j,'Rebound_Year','2015')
    j+=1

#Calculating the time difference of number of years elapse between year of minimum value and rebound year        
df['TimeDiff']=df['Rebound_Year']-pd.to_numeric(df['MinVal_Year'])

让我们看一下结果数据框中的关键列:

df[['county','prerecession val','MinVal_Year','Rebound_Year','TimeDiff']]

enter image description here

希望此端到端测试解决方案可以为您提供帮助。