我有一个20000 * 16(行*列)的巨大数据集。
我使用R制作此数据集的热图,但由于行数很多,似乎不可能。我想删除几乎没有或基本变化的数据点,从而减少数据矩阵中的行数。
有人可以指导我怎么做吗?
样本数据集是
Gene A B C D E F G H I J K L M N O
PQ1 7.3159 9.3802 10.77 8.701 13.6066 8.3253 9.0556 9.8801 9.0776 11.2029 7.61 10.8403 9.2378 12.1697 9.7482
PQ2 7.4715 5.2955 10.2275 6.3606 10.1463 5.9968 6.2673 8.6119 6.153 6.7903 4.0843 13.0875 6.8167 8.3186 6.7643
PQ3 0 0 0 0 0.0026 0 0 0 0 0 0 0 0 0 0.0037
PQ4 1.776 1.125 1.3508 1.2489 2.1252 2.1057 1.0177 1.6063 1.0053 0.9571 1.4972 1.3998 1.0935 2.4737 1.2063
PQ5 0.1024 0.092 0.0473 0.071 0.1227 0.2047 0.2481 0.1089 0.0499 0.1381 0.057 0.0953 0.0433 0.0651 0.0598
PQ6 5.4296 0.1688 2.4767 0.2507 0.5087 4.2835 2.2989 8.6027 3.1126 0.4565 0.167 2.9066 3.195 0.942 5.8904
PQ7 0.2918 11.5673 4.9554 0 1.6693 1.6301 0.4985 2.4444 0.6217 1.4638 3.2648 0.5773 3.1071 7.651 0.4068
PQ8 0 0 0 0 0.0575 0.1018 0 0.0422 0 0 0 0.0257 0.0276 0 0
PQ9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
PQ10 18.789 24.8681 29.8037 33.3986 37.8269 24.4719 21.1101 26.9985 21.9897 25.3416 26.77 23.1337 20.5193 27.0328 23.9777
PQ11 0 0.004 0 0 0.02 0 0 0.0265 0.0348 0 0.032 0 0.026 0 0.0167
PQ12 2.8442 4.7904 5.8717 3.2287 5.0917 1.5291 4.1187 6.207 3.532 5.4896 5.7066 5.1487 6.4386 11.2159 7.3469
PQ13 0 0.12 0.1776 0 0 0.0366 0.027 0 0 0 0.0462 0 0 0 0
PQ14 0 0 0 0 0 0 0.0136 0 0 0 0 0 0.0083 0 0
PQ15 0 0 0 0 0 0 0 0 0 0.0322 0 0 0 0 0
PQ16 0.0321 0.0469 0 0 0 0 0.0342 0 0 0 0 0 0 0 0
PQ17 0 0 0 0 0.0466 0.0225 0.0619 0.284 0.1252 0.0205 0 0.0371 0.1413 0.018 0.1238
PQ18 9.2029 12.3713 12.0135 7.7052 9.9121 7.3582 9.6782 12.5931 9.8137 12.4413 11.3418 14.4504 7.9965 8.5895 6.1705
PQ19 16.6408 9.1365 13.8613 12.6089 12.2094 16.5078 22.4689 16.4531 16.2172 15.6118 14.8256 18.5057 16.5483 13.5991 15.4934
PQ20 26.6048 24.1932 25.3238 27.9098 29.5022 25.4348 31.1095 30.4802 28.4243 21.2893 18.7577 27.3286 26.2074 30.6207 25.0771
PQ21 1.1114 0.067 0.3146 0.4593 0.3675 2.773 0.8119 0.5015 0.4696 0.5876 0.1406 0.2492 0.8565 0.2326 0.1521
PQ22 7.4962 5.3051 8.9577 6.1617 8.5887 8.2902 7.0983 7.1107 6.0231 6.9078 6.6685 7.3996 7.3611 8.3344 5.5536
PQ23 13.596 7.4782 9.6589 6.3121 10.7004 8.5035 9.769 10.1801 6.7358 5.0971 6.2171 9.713 7.0575 10.0523 7.5863
PQ24 18.564 35.9577 30.4134 27.9277 41.4544 23.1528 15.4656 32.0211 24.979 24.7365 41.5781 28.6164 34.8429 37.6385 27.1767
PQ25 15.5685 17.3154 17.0986 10.2068 13.5607 8.281 8.57 14.104 8.5732 8.3098 15.7368 18.3766 14.6625 14.2864 12.3646
PQ26 3.6639 5.5865 6.4437 2.7832 4.6902 6.4854 3.305 4.8913 3.0334 4.1835 5.9565 5.0441 4.4169 6.005 3.5551
PQ27 0.2116 0.0035 0.1193 0.0462 0.1113 0.3879 0.2976 0.9519 0.3039 0.0613 0.0478 0.5218 0.3197 0.1381 0.2277
PQ28 32.5026 28.1368 28.2335 25.6904 36.3761 26.779 36.4265 30.5154 35.1618 23.8327 27.087 24.7966 29.477 30.0189 26.1931
PQ29 1.8439 1.4574 1.2994 2.4006 0.6938 2.7233 0.6461 0.5976 1.7659 3.4405 1.5791 0.3336 1.8652 1.6685 2.0173
PQ30 1.7028 0.9633 2.0401 1.4563 1.4204 3.7509 1.843 2.071 2.3559 2.3659 1.2402 2.0673 2.2783 2.4221 1.3163
PQ31 0.1401 0.0283 0.3815 0.0434 0.1124 1.0891 0.0681 0.3404 0.2097 0.0552 0.1386 0.1835 0.2828 0.2267 0.2176
PQ32 3.1838 2.1398 4.1528 1.9499 3.0831 3.6193 3.0609 4.4113 2.4607 1.604 3.2404 4.4924 3.0917 4.525 3.0178
PQ33 0.0187 0.042 0.107 0 0.0162 0.0114 0.0366 0.0467 0.0532 0 0 0 0.0703 0.1173 0.0472
PQ34 1.3782 0.1604 0.3452 0.2124 0.0376 0.7386 0.4819 2.5638 0.3134 0.2188 1.6717 1.2121 0.4294 0.2202 0.2482
PQ35 0.0634 0.0294 0.0735 0.005 0.0558 0.1777 0.1734 0.0536 0.0259 0.0459 0.0217 0.0388 0.073 0.0206 0.074
PQ36 7.3565 4.5738 4.9642 1.8203 4.8537 12.1248 12.4298 8.541 11.8094 12.964 7.1189 17.0531 10.7116 6.5249 15.9312
PQ37 19.2056 16.5482 10.3252 26.8747 30.8489 26.5403 27.2519 12.1769 34.8122 26.1242 14.3651 12.6533 43.6538 24.7434 19.5469
PQ38 1.4191 5.2542 2.7084 4.6994 2.6367 3.0067 3.2322 3.8202 3.6902 3.6689 3.5244 1.3118 6.2961 3.4399 4.7755
PQ39 0.0032 0 0.0419 0 0.0001 0.0044 0 0.0259 0.0059 0 0 0 0.0203 0 0.0142
PQ40 5.5934 1.2258 7.0247 3.1928 3.7698 14.0234 2.3485 6.2129 4.1372 2.4574 3.8062 5.01 3.4968 4.6268 2.7763
PQ41 0.0664 0.0029 0.984 0 0.0448 0.3315 0.0709 0.5556 0.066 0.0443 0.1812 0.0918 0.1818 0.0491 0.2315
PQ42 12.4147 11.7431 20.2819 16.29 13.8172 16.5791 5.4218 11.46 15.264 26.2695 21.1681 14.128 16.8515 15.1775 11.4873
PQ43 0.0047 0.001 0.0731 0.0118 0.0169 0.207 0.0649 0.9764 0.0626 0.0002 0.0034 0.0657 0.3199 0.0003 0.2807
PQ44 0.135 0.0166 0.6497 0.0055 0.0229 0.1664 0.1529 0.4149 0.0361 0.0109 0.255 0.1788 0.1709 0.0291 0.3004
PQ45 56.8427 37.853 26.6238 10.5706 33.1238 45.9608 13.0512 17.1816 17.2876 12.7038 48.581 57.7831 20.1544 55.8307 17.7855
PQ46 0 0 0.0638 0 0 0 0 0 0 0 0 0 0 0 0
PQ47 0.3183 0.5558 0.9872 0.7507 0.963 0.9077 0.5323 2.3656 0.9466 0.8255 0.3479 1.184 1.8744 0.6751 0.3804
PQ48 0.0887 0.0237 0.5628 0.0256 0.3346 0.3528 0.1441 0.7293 0.2763 0.1582 0.0346 0.2104 0.3426 0.2687 0.152
我使用的命令是
rpkm<-read.table("heatmap_table.txt", header=T)
row.names(rpkm)<-rpkm$Gene
rpkm<-rpkm[,2:16]
rpkm_matrix<-data.matrix(rpkm)
### some where here I need to put the variance filter.
heatmap(rpkm_matrix)
谢谢
答案 0 :(得分:1)
正如Carl Witthoft在评论中指出的那样,您要做的事情将改变您对数据集所做或将要做出的任何推断。您要删除的数据可能很重要。
那就是说,你需要具体说明什么&#34;小变化&#34;实际上是,但举个例子。如果您想要行的方差大于的所有行,例如0.0001,那么您可以定义一个参数,然后使用apply
来获得方差大于参数的行。在数据的前20行中,删除了第3,9,14和15行。它还应删除方差为NA的那些行。
假设您的数据框名为dat
,
> getVar <- apply(dat[, -1], 1, var)
> param <- 1e-4
> dat[getVar > param & !is.na(getVar), ]