Which algorithm is good for genetics duplicated data?

时间:2015-07-28 22:27:07

标签: algorithm data-mining data-analysis genetics

My question is more related to find the best algorithm for my data set.

I have data which has three columns namely, individuals, and disease and test score (I have 50 test scores features but only one test score feature is mentioned here). I have 3000 individuals and possible values for disease feature is disA, disB and disC where as test score is a discrete variable. Disease feature is my class attribute.

One individual can have up to three different diseases but only one test score value. My objective is to classify test scores on the basis of disease (which test scores are associated with which disease) But here problem is if one individual has three diseases then all of test scores will be repeated three times. For example, for individual aa (with all disA, disB and disC) test score is 12. And then analysis file will look like that

individuals, Disease, Test Score
aa,disA,12,...
aa,disB,12,...
aa,disC,12,...

This will result into biased analysis. Is there any data mining algorithm or statistical test for such type of data? I cannot remove these patients because they are highest proportion of data set.

2 个答案:

答案 0 :(得分:0)

Why not recast the problem as a one step mapping from test score to the set of diseases? Using your example, the first line of data below shows 'aa' as having all diseases while 'bb' only has the A disease.

individuals, DiseaseA, DiseaseB, DiseaseC, Test Score
aa,true,true,true,12
bb,true,false,false,10

答案 1 :(得分:0)

我会使用Hadley Wickham在reshape包中描述的以下格式:

http://had.co.nz/reshape/

http://www.jstatsoft.org/v21/i12

示例:

individuals, variable, value
aa,disease,disA
aa,disease,disB
aa,disease,disC
aa,testscore,12