我在R中使用前列腺基因表达数据(http://icos.cs.nott.ac.uk/datasets/microarray.html)并尝试将所有条目转换为数字以编写相似性函数。如何使用表达式值将所有条目从因子转换为数值?如果我这样索引数据框,
> prostate[5,4]
[1] 3.17469778457247
2093 Levels: 0.133822364738809 ... normal
我只想要值3.17 ......
答案 0 :(得分:1)
该文件在最后一行有字符数据。当R读取它时,所有因素都变成因素,因为它不是数字。在bash中你可以看到:
tail -2 prostate_preprocessed.txt
AFFX-YEL021w/URA3_at 3.31255956783592 4.05800228545385 4.26348960812486 4.2180869800299 4.90599509636775 4.33488048792038 4.96535865133757 4.35350385526143 4.18529970123263 3.85103067777549 4.03836053811841 3.70345720098741 4.11379278781317 4.01121240340167 4.68296544299334 4.33584797205546 4.16864882878781 4.32781853396998 3.85145280458377 3.76586006943253 4.67388887037993 3.87182653639402 3.74997314075837 3.94258426954186 ...
tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor tumor
但是你可以通过只读到倒数第二行(bash再次)来修复它:
wc -l prostate_preprocessed.txt
2136 prostate_preprocessed.txt
现在在R中:
> prostate=read.table("prostate_preprocessed.txt", nrows=2135)
> prostate[4,5]
[1] 6.379761
修改强> ps它是一种奇怪的文件格式,因为您可能希望最后一行中的肿瘤值为列标题:
> cn=read.table("prostate_preprocessed.txt", skip=2135, colClasses="character")
> colnames(prostate)<-cn[1,]