有什么方法可以通过计算R的互信息或信息增益来为数据帧进行特征选择?

时间:2019-07-15 13:20:13

标签: r dplyr feature-extraction

我感兴趣的是为数据帧的所有行(也称为要素)计算互信息熵,然后通过查找其MI值来进行特征选择。在我的数据集中,行是原始要素集,列是不同的组。我了解NLP中的成对互信息(PMI)的概念,但不太确定R中的MI。本质上,我想通过计算互信息熵来进行特征选择。我如何在R中做到这一点?任何有效的方法来实现这一目标?还是有任何R软件包可以执行此功能选择?任何想法将不胜感激。

可复制的数据

以下是可以使用的可复制数据:

> dput(HTA20_filt_corr[1:20, 1:5])
structure(c(6.06221469449721, 3.79648446367096, 4.44302662142323, 
5.83652223195279, 2.68934375273141, 2.74561888109989, 3.79468365910661, 
2.84818282222582, 2.14058977019523, 2.6928480064245, 2.35292391447048, 
2.48476830655452, 6.53876010917445, 4.65751152599579, 3.04781583130435, 
5.77123333840058, 3.12373340327186, 2.19534644753427, 2.97565909758917, 
3.32457362519432, 5.8755020052495, 3.45024474095539, 4.3934877055859, 
5.89836406552412, 2.55675627493564, 2.70765553292035, 4.29971184424969, 
2.48325694938049, 2.26880029802564, 3.03461160119094, 2.3853610213164, 
2.28880889278209, 7.38935014141236, 5.99396449205588, 2.81020023855867, 
6.15414625452898, 2.71038534186171, 2.23803889487068, 2.83352503485538, 
3.40195667040699, 6.12613148162098, 3.62841140410044, 4.6237834519809, 
6.01979203584278, 2.61341541015611, 2.80774129091983, 3.81085169542991, 
3.2386968734862, 2.3315210232915, 2.75618624035735, 2.36292219228603, 
2.31409329648109, 6.89661896623484, 4.94260091412701, 3.30560274327296, 
5.4547259473827, 2.41056409104863, 2.26899775961818, 2.6699701841279, 
3.01459760807053, 6.1345548976595, 3.51232455992681, 4.66743523288194, 
5.98400432133011, 2.69430042092269, 2.8653583834812, 3.81895258294878, 
2.72080210986981, 2.33064119419619, 2.77388400895015, 2.46939314182722, 
2.28927162732448, 6.93808821971072, 5.63306489420911, 2.75877942216047, 
5.82872398278859, 2.92710196023309, 2.34137181372226, 2.52271243341233, 
2.96285787017003, 6.28953417729806, 3.56819306931016, 4.97483476597509, 
6.1149144301085, 2.73207812554522, 3.00137677271996, 4.03594900960396, 
2.58058159047299, 2.24052626899434, 3.2286586324064, 2.30413560438815, 
2.38147147362554, 6.58149585137493, 4.16189923349488, 2.36086328728537, 
5.57065453220316, 2.57313948725185, 2.36046878474564, 2.54370710157379, 
2.97488700289993), .Dim = c(20L, 5L), .Dimnames = list(c("1_at", 
"10_at", "100009613_at", "100009676_at", "10003_at", "100033411_at", 
"100033414_at", "100033418_at", "100033422_at", "100033423_at", 
"100033424_at", "100033425_at", "100033426_at", "100033431_at", 
"100033432_at", "100033434_at", "100033436_at", "100033437_at", 
"100033438_at", "100033439_at"), c("Tarca_001_P1A01", "Tarca_004_P1A04", 
"Tarca_005_P1A05", "Tarca_007_P1A07", "Tarca_008_P1A08")))

我微不足道的尝试

require(infotheo)
apply(HTA20_filt_corr,1, mutinformation)

但是我认为这不是计算互信息并基于此进行特征选择的正确方法。谁能指出我如何做到这一点?谢谢

所需的输出

基本上,在我的预期输出中,应通过查找原始数据帧的互信息熵表来缩小/过滤特征。如何在R中完成这项工作?有什么想法吗?

1 个答案:

答案 0 :(得分:2)

相互信息有点像相关性:为此您至少需要两个向量。利用您的数据,您可以计算出例如100009613_at和10003_at之间的相互信息。还是所有功能都违背了所有功能。但是首先,您需要转换数据:相互信息首先需要离散化。

mtx <- data.matrix(HTA20_filt_corr)
mtx <- t(mtx) # features in columns
mtxd <- discretize(mtx, nbins=3)

mutinformation(mtxd[,"100009613_at"], mtxd[,"10003_at"])
# [1] 0.7776613

# or, each against each
eae <- mutinformation(mtxd)

看看mtxd。它是一个方矩阵。那么,您想如何使用它来过滤功能?