我有一个数据集,其中包含多次从人们那里收集的调查回复。以下是一个示例数据集。
library(dplyr)
DATA <- data.frame(ID = c(1,22,22,333,333,333,4444,4444,4444,4444),
Gender = c("M","F","F","M","M","NotAvailable","M","M","F","NotAvailable"),
MaritalStatus = c("W","M","M","UM","NotAvailable","UM","M","UM","W","NotAvaiable"),
Name = c("Available","NotAvailable","NotAvailable","Available","Available","Available","Available","NotAvailable",
"Available","NotAvailable"),
Age = c(20,30,30,21,22,23,33,33,33,34),
EmailIND = c(0,1,1,0,0,1,1,1,1,1),
Irrelevant = c(12,3123,312,343,554,66,67,56,123,434)
)
> DATA
ID Gender MaritalStatus Name Age EmailIND Irrelevant
1 1 M W Available 20 0 12
2 22 F M NotAvailable 30 1 3123
3 22 F M NotAvailable 30 1 312
4 333 M UM Available 21 0 343
5 333 M NotAvailable Available 22 0 554
6 333 NotAvailable UM Available 23 1 66
7 4444 M M Available 33 1 67
8 4444 M UM NotAvailable 33 1 56
9 4444 F W Available 33 1 123
10 4444 NotAvailable NotAvaiable NotAvailable 34 1 434
我的目标是创建2个变量:
相似性标志 - 如果个人提供的信息在每次调查中都相同,则为1,<0>
幅度相似性 - 给出个人在不同调查中提供类似信息的数字分数。
以下是我的解决方案:
getSimRespFlag <- function(x){
return(as.numeric(length(unique(x)) == 1))
}
不相关代表那些不会用于此分析的列
numberOfCols <- ncol(DATA)
similarity_DATA <- DATA%>%
select(-c(Irrelevant))%>%
group_by(ID)%>%
summarise_all(funs(getSimRespFlag))%>%
mutate( SimilarResp_Flag = as.numeric((rowSums(.[2:(numberOfCols-1)])/(numberOfCols-2)) == 1),
Magnitude_Similarity = rowSums(.[2:(numberOfCols-1)])/(numberOfCols-2))%>%
select(ID,SimilarResp_Flag,Magnitude_Similarity)
> similarity_DATA
# A tibble: 4 × 3
ID SimilarResp_Flag Magnitude_Similarity
<dbl> <dbl> <dbl>
1 1 1 1.0
2 22 1 1.0
3 333 0 0.2
4 4444 0 0.2
DATA <- left_join(DATA,similarity_DATA,by ="ID")
> DATA
ID Gender MaritalStatus Name Age EmailIND Irrelevant SimilarResp_Flag Magnitude_Similarity
1 1 M W Available 20 0 12 1 1.0
2 22 F M NotAvailable 30 1 3123 1 1.0
3 22 F M NotAvailable 30 1 312 1 1.0
4 333 M UM Available 21 0 343 0 0.2
5 333 M NotAvailable Available 22 0 554 0 0.2
6 333 NotAvailable UM Available 23 1 66 0 0.2
7 4444 M M Available 33 1 67 0 0.2
8 4444 M UM NotAvailable 33 1 56 0 0.2
9 4444 F W Available 33 1 123 0 0.2
10 4444 NotAvailable NotAvaiable NotAvailable 34 1 434 0 0.2
有没有更好的方法来获得相似性标记和相似度,如文件(具有数值,但我有分类和数字)的余弦相似性。我的数据集很大,此操作需要时间,因此任何快速解决方案也都可以使用。