我有一个代码来计算x和y变量的斜率(theil-sen斜率),我想根据它们的组在一个值列表上运行它。我的文件看起来像这样:
station_id year Sum
210018 1917 329.946
210018 1918 442.214
210018 1919 562.864
210018 1920 396.748
210018 1921 604.266
210019 1917 400.946
210019 1918 442.214
210019 1919 600.864
210019 1920 250.748
210019 1921 100.266
我的输出应该是:
210018: -117189.92, 61.29
210019: 164382, -85.45
我使用的代码是:
def theil_sen(x,y):
n = len(x)
ord = numpy.argsort(x)
xs = x[ord]
ys = y[ord]
vec1 = numpy.zeros( (n,n) )
for ii in range(n):
for jj in range(n):
vec1[ii,jj] = ys[ii]-ys[jj]
vec2 = numpy.zeros( (n,n) )
for ii in range(n):
for jj in range(n):
vec2[ii,jj] = xs[ii]-xs[jj]
v1 = vec1[vec2>0]
v2 = vec2[vec2>0]
slope = numpy.median( v1/v2 )
coef = numpy.zeros( (2,1) )
b_0 = numpy.median(y)-slope*numpy.median(x)
b_1 = slope
res = y-b_1*x-b_0 # residuals
return (b_0,b_1,res)
stat=df.groupby(['station_id']).apply(lambda x: theil_sen(x['year'], x['Sum']))
print stat
所以year
是我的x变量而Sum
是我的y变量。代码对于站210018正确执行,但对于210019,它返回nan。任何帮助将不胜感激。
答案 0 :(得分:0)
numpy.argsort(x)
与熊猫系列一起折腾。在第一组之后它没有按预期工作,因为索引不再是0-n。而是在x, y
Numpy Arrays上工作。
这很有效。
def theil_sen(x,y):
x = x.values
y = y.values
n = len(x)
ord = numpy.argsort(x)
xs = x[ord]
ys = y[ord]
vec1 = numpy.zeros( (n,n) )
for ii in range(n):
for jj in range(n):
vec1[ii,jj] = ys[ii]-ys[jj]
vec2 = numpy.zeros( (n,n) )
for ii in range(n):
for jj in range(n):
vec2[ii,jj] = xs[ii]-xs[jj]
v1 = vec1[vec2>0]
v2 = vec2[vec2>0]
slope = numpy.median( v1/v2 )
coef = numpy.zeros( (2,1) )
b_0 = numpy.median(y)-slope*numpy.median(x)
b_1 = slope
res = y-b_1*x-b_0 # residuals
return (b_0,b_1,res)
stat=df.groupby(['station_id']).apply(lambda x: theil_sen(x['year'], x['Sum']))
print stat
station_id
210018 (-117189.927333, 61.2986666667, [10.3293333333...
210019 (164382.3745, -85.4515, [-170.903, -44.1835, 1...
dtype: object
只有现有功能的添加才是这两行。
x = x.values
y = y.values
而且,现在,当你在系列对象的第一个组之后应用np.argsort()时,让我们看看发生了什么错误。让我们取第二组值。这是 -
In [70]: x
Out[70]:
5 1917
6 1918
7 1919
8 1920
9 1921
Name: year, dtype: int64
In [71]: numpy.argsort(x)
Out[71]:
5 0
6 1
7 2
8 3
9 4
Name: year, dtype: int64
In [72]: x[numpy.argsort(x)]
Out[72]:
year
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
Name: year, dtype: float64
由于ord
始终来自[0-n]
,后来的群组x[ord]
显然会返回NaN
值。