我正在尝试实现内核密度估算。但是我的代码没有提供它应该的答案。它也写在朱莉娅,但代码应该是自我解释。
以下是算法:
,其中
因此,该算法测试x和观察X_i之间的距离是否小于1,这是由某个常数因子(binwidth)加权的。如果是这样,它会为该值指定0.5 /(n * h),其中n = #of observation。
这是我的实施:
#Kernel density function.
#Purpose: estimate the probability density function (pdf)
#of given observations
#@param data: observations for which the pdf should be estimated
#@return: returns an array with the estimated densities
function kernelDensity(data)
|
| #Uniform kernel function.
| #@param x: Current x value
| #@param X_i: x value of observation i
| #@param width: binwidth
| #@return: Returns 1 if the absolute distance from
| #x(current) to x(observation) weighted by the binwidth
| #is less then 1. Else it returns 0.
|
| function uniformKernel(x, observation, width)
| | u = ( x - observation ) / width
| | abs ( u ) <= 1 ? 1 : 0
| end
|
| #number of observations in the data set
| n = length(data)
|
| #binwidth (set arbitraily to 0.1
| h = 0.1
|
| #vector that stored the pdf
| res = zeros( Real, n )
|
| #counter variable for the loop
| counter = 0
|
| #lower and upper limit of the x axis
| start = floor(minimum(data))
| stop = ceil (maximum(data))
|
| #main loop
| #@linspace: divides the space from start to stop in n
| #equally spaced intervalls
| for x in linspace(start, stop, n)
| | counter += 1
| | for observation in data
| | |
| | | #count all observations for which the kernel
| | | #returns 1 and mult by 0.5 because the
| | | #kernel computed the absolute difference which can be
| | | #either positive or negative
| | | res[counter] += 0.5 * uniformKernel(x, observation, h)
| | end
| | #devide by n times h
| | res[counter] /= n * h
| end
| #return results
| res
end
#run function
#@rand: generates 10 uniform random numbers between 0 and 1
kernelDensity(rand(10))
并且正在返回:
> 0.0
> 1.5
> 2.5
> 1.0
> 1.5
> 1.0
> 0.0
> 0.5
> 0.5
> 0.0
其总和为:8.5(累积分配函数。应为1。)
所以有两个错误:
例如:
> kernelDensity(rand(1000))
> 953.53
我相信我实施了公式1:1,因此我真的不明白错误在哪里。
答案 0 :(得分:5)
我不是KDE的专家,所以请尽一切努力,但是代码的实现非常相似(但要快得多!):
function kernelDensity{T<:AbstractFloat}(data::Vector{T}, h::T)
res = similar(data)
lb = minimum(data); ub = maximum(data)
for (i,x) in enumerate(linspace(lb, ub, size(data,1)))
for obs in data
res[i] += abs((obs-x)/h) <= 1. ? 0.5 : 0.
end
res[i] /= (n*h)
end
sum(res)
end
如果我没弄错的话,密度估计值应该加到1,即我们希望kernelDensity(rand(100), 0.1)/100
至少接近1.在上面的实现中,我到达那里,给予或采取5 %,但是我们再次不知道0.1是最佳带宽(使用h=0.135
而不是我达到0.1%以内),并且已知均匀内核只有大约93%“有效”
在任何情况下,Julia都有一个非常好的内核密度包here,所以你可能应该只做Pkg.add("KernelDensity")
而不是尝试编写你自己的Epanechnikov内核:)
答案 1 :(得分:3)