在函数上使用groupby

时间:2015-04-07 20:52:42

标签: python pandas statistics

我有一个代码来计算x和y变量的斜率(theil-sen斜率),我想根据它们的组在一个值列表上运行它。我的文件看起来像这样:

station_id year Sum 210018 1917 329.946 210018 1918 442.214 210018 1919 562.864 210018 1920 396.748 210018 1921 604.266 210019 1917 400.946 210019 1918 442.214 210019 1919 600.864 210019 1920 250.748 210019 1921 100.266

我的输出应该是:

210018: -117189.92, 61.29
210019: 164382, -85.45

我使用的代码是:

def theil_sen(x,y):
    n   = len(x)
    ord = numpy.argsort(x)
    xs  = x[ord]
    ys  = y[ord]
    vec1 = numpy.zeros( (n,n) )
    for ii in range(n):
        for jj in range(n):
            vec1[ii,jj] = ys[ii]-ys[jj]
    vec2 = numpy.zeros( (n,n) )
    for ii in range(n):
        for jj in range(n):
            vec2[ii,jj] = xs[ii]-xs[jj]
    v1    = vec1[vec2>0]    
    v2    = vec2[vec2>0]     
    slope = numpy.median( v1/v2 )
    coef  = numpy.zeros( (2,1) ) 
    b_0   = numpy.median(y)-slope*numpy.median(x)
    b_1   = slope
    res   = y-b_1*x-b_0 # residuals 

    return (b_0,b_1,res)

stat=df.groupby(['station_id']).apply(lambda x: theil_sen(x['year'], x['Sum']))

print stat

所以year是我的x变量而Sum是我的y变量。代码对于站210018正确执行,但对于210019,它返回nan。任何帮助将不胜感激。

1 个答案:

答案 0 :(得分:0)

numpy.argsort(x)与熊猫系列一起折腾。在第一组之后它没有按预期工作,因为索引不再是0-n。而是在x, y Numpy Arrays上工作。

这很有效。

def theil_sen(x,y):
    x = x.values
    y = y.values
    n   = len(x)
    ord = numpy.argsort(x)
    xs  = x[ord]
    ys  = y[ord]
    vec1 = numpy.zeros( (n,n) )
    for ii in range(n):
        for jj in range(n):
            vec1[ii,jj] = ys[ii]-ys[jj]
    vec2 = numpy.zeros( (n,n) )
    for ii in range(n):
        for jj in range(n):
            vec2[ii,jj] = xs[ii]-xs[jj]
    v1    = vec1[vec2>0]    
    v2    = vec2[vec2>0]     
    slope = numpy.median( v1/v2 )
    coef  = numpy.zeros( (2,1) ) 
    b_0   = numpy.median(y)-slope*numpy.median(x)
    b_1   = slope
    res   = y-b_1*x-b_0 # residuals 

    return (b_0,b_1,res)

stat=df.groupby(['station_id']).apply(lambda x: theil_sen(x['year'], x['Sum']))

print stat


station_id
210018        (-117189.927333, 61.2986666667, [10.3293333333...
210019        (164382.3745, -85.4515, [-170.903, -44.1835, 1...
dtype: object

只有现有功能的添加才是这两行。

x = x.values
y = y.values

而且,现在,当你在系列对象的第一个组之后应用np.argsort()时,让我们看看发生了什么错误。让我们取第二组值。这是 -

In [70]: x
Out[70]:
5    1917
6    1918
7    1919
8    1920
9    1921
Name: year, dtype: int64

In [71]: numpy.argsort(x)
Out[71]:
5    0
6    1
7    2
8    3
9    4
Name: year, dtype: int64

In [72]: x[numpy.argsort(x)]
Out[72]:
year
0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
Name: year, dtype: float64

由于ord始终来自[0-n],后来的群组x[ord]显然会返回NaN值。