我正在尝试在python中重写这个matlab / octave repo。我想出了一个似乎是熵函数的实现(见下文)。经过一些研究,我在google上发现我可以使用scipy's entropy implementation来进行python。但是在阅读了更多关于scipy的熵公式(例如S = -sum(pk * log(pk),axis = 0))之后,我怀疑这两个是在计算相同的东西......
有人能证实我的想法吗?
%�������� author by YangSong 2010.11.16 C230
%file:ys_sampEntropy.m
% code is called from line 101 of algotrading.m
% => entropy180(i)=ys_sampEntropy(kmeans180s1(i,1:180));
% where kmeans180s1 is an array of size 100x181 containing the kmeans
% centroids and the prize label at position 181.
function sampEntropy=ys_sampEntropy(xdata)
m=2;
n=length(xdata);
r=0.2*std(xdata);%ƥ��ģ��������ֵ
%r=0.05;
cr=[];
gn=1;
gnmax=m;
while gn<=gnmax
d=zeros(n-m+1,n-m);% ���ž��������ľ���
x2m=zeros(n-m+1,m);%���ű任�������
cr1=zeros(1,n-m+1);%���Ž����ľ���
k=1;
for i=1:n-m+1
for j=1:m
x2m(i,j)=xdata(i+j-1);
end
end
x2m;
for i=1:n-m+1
for j=1:n-m+1
if i~=j
d(i,k)=max(abs(x2m(i,:)-x2m(j,:)));%��������Ԫ�غ���ӦԪ�صľ���
k=k+1;
end
end
k=1;
end
d;
for i=1:n-m+1
[k,l]=size(find(d(i,:)<r));%����RС�ĸ�������L
cr1(1,i)=l;
end
cr1;
cr1=(1/(n-m))*cr1;
sum1=0;
for i=1:n-m+1
if cr1(i)~=0
%sum1=sum1+log(cr1(i));
sum1=sum1+cr1(i);
end %if����
end %for����
cr1=1/(n-m+1)*sum1;
cr(1,gn)=cr1;
gn=gn+1;
m=m+1;
end %while����
cr;
sampEntropy=log(cr(1,1))-log(cr(1,2));
答案 0 :(得分:0)
代码非常难以理解,但很明显,这不是离散变量的Shannon熵计算的实现,如scipy中所实现的。相反,这模糊地看起来像用于估计连续变量的熵的Kozachenko-Leonenko k-最近邻估计(Kozachenko&amp; Leonenko 1987)。
该估算器的基本思想是查看相邻数据点之间的平均距离。直觉是,如果距离很大,数据中的离散度很大,因此熵很大。实际上,不是采用最近邻距离,而是倾向于采用k-最近邻距离,这倾向于使估计更加稳健。
代码显示了一些距离计算
d(i,k)=max(abs(x2m(i,:)-x2m(j,:)));
并且有一些点数比某个固定距离更近:
[k,l]=size(find(d(i,:)<r));
然而,很明显,这不完全是Kozachenko-Leonenko估计,而是一些屠杀版本。
如果你最终想要计算Leonenko估算器,我有一些代码来实现我的github效果:
https://github.com/paulbrodersen/entropy_estimators
在看到这个烂摊子后,我不再确定他/她实际上是否 使用(试图使用?)离散变量的经典香农信息定义,即使输入是连续的:
for i=1:n-m+1
[k,l]=size(find(d(i,:)<r));%����RС�ĸ�������L
cr1(1,i)=l;
end
cr1;
cr1=(1/(n-m))*cr1;
for循环计算比r更接近的数据点数,然后片段中的最后一行将该数字除以某个间隔以获得密度。
然后将这些密度总结如下:
for i=1:n-m+1
if cr1(i)~=0
%sum1=sum1+log(cr1(i));
sum1=sum1+cr1(i);
end %if����
end %for����
然后我们得到这些位(再次!):
cr1=1/(n-m+1)*sum1;
cr(1,gn)=cr1;
并且
sampEntropy=log(cr(1,1))-log(cr(1,2));
我的大脑拒绝相信返回的值可能是您的平均log(p)
,但我不再100%确定。
无论哪种方式,如果您想计算连续变量的熵,您应该为数据拟合分布,或者您应该使用Kozachenko-Leonenko估算器。请写出更好的代码。
答案 1 :(得分:0)
##Entropy
def entropy(Y):
"""
Also known as Shanon Entropy
Reference: https://en.wikipedia.org/wiki/Entropy_(information_theory)
"""
unique, count = np.unique(Y, return_counts=True, axis=0)
prob = count/len(Y)
en = np.sum((-1)*prob*np.log2(prob))
return en
#Joint Entropy
def jEntropy(Y,X):
"""
H(Y;X)
Reference: https://en.wikipedia.org/wiki/Joint_entropy
"""
YX = np.c_[Y,X]
return entropy(YX)
#Conditional Entropy
def cEntropy(Y, X):
"""
conditional entropy = Joint Entropy - Entropy of X
H(Y|X) = H(Y;X) - H(X)
Reference: https://en.wikipedia.org/wiki/Conditional_entropy
"""
return jEntropy(Y, X) - entropy(X)
#Information Gain
def gain(Y, X):
"""
Information Gain, I(Y;X) = H(Y) - H(Y|X)
Reference: https://en.wikipedia.org/wiki/Information_gain_in_decision_trees#Formal_definition
"""
return entropy(Y) - cEntropy(Y,X)
答案 2 :(得分:0)
这就是我使用的:
def entropy(data,bins=None):
if bins is None : bins = len(np.unique(data))
cx = np.histogram(data, bins)[0]
normalized = cx/float(np.sum(cx))
normalized = normalized[np.nonzero(normalized)]
h = -sum(normalized * np.log2(normalized))
return h
"""
Approximate entropy : used to quantify the amount of regularity and the unpredictability of fluctuations
over time-series data.
The presence of repetitive patterns of fluctuation in a time series renders it more predictable than a
time series in which such patterns are absent.
ApEn reflects the likelihood that similar patterns of observations will not be followed by additional
similar observations.
[7] A time series containing many repetitive patterns has a relatively small ApEn,
a less predictable process has a higher ApEn.
U: time series
The value of "m" represents the (window) length of compared run of data, and "r" specifies a filtering level.
https://en.wikipedia.org/wiki/Approximate_entropy
Good m,r values :
m = 2,3 : rolling window
r = 10% - 25% of seq std-dev
"""
def approx_entropy(U, m, r):
def _maxdist(x_i, x_j):
return max([abs(ua - va) for ua, va in zip(x_i, x_j)])
def _phi(m):
x = [[U[j] for j in range(i, i + m - 1 + 1)] for i in range(N - m + 1)]
C = [len([1 for x_j in x if _maxdist(x_i, x_j) <= r]) / (N - m + 1.0) for x_i in x]
return (N - m + 1.0)**(-1) * sum(np.log(C))
N = len(U)
return abs(_phi(m + 1) - _phi(m))