我有一个数据框,其中包含在许多数据集中找到的元素对。对的顺序无关紧要,它们按字母顺序给出一次,但第一个实例可能在数据库之间有所不同,如示例所示。
data <- data.frame(i = c("b","b","b","c"), j = c("c","d","d","a"), +
database = c(1,1,2,3))
我想为它们生成一个分数,以显示每个数据库中包含相同对的实例的比率。
我可以想象一个像这样粗略的功能:
# For each database that includes particular i or j, test whether
# they have a connection to another particular element at j or i,
# respectively. Count the number of successes.
# Divide it by:
# Count(number of databases that contain either of the members of the pair in i or j)
我期望从示例数据集(顺序不重要)得到的结果是:
a c 0.5
b c 0.33
b d 1
我可以看到这个原始循环系统如何工作,但我确信有一个更优雅的解决方案,任何人都可以提供帮助?也许在图库中有一个特定的功能。谢谢!
答案 0 :(得分:1)
只是有点玩连接(即合并)
library(dplyr)
data <- data.frame(i = c("b","b","b","c"), j = c("c","d","d","a"),
database = c(1,1,2,3), stringsAsFactors = FALSE)
# Sort pairs lexicographic and count occurences of pairs
data2 <- mutate(data, x=pmin(i,j), y=pmax(i,j))
pairs_all <- summarize(group_by(data2, x, y), n_all = length(unique(database)))
# Introduce helper index to identify the pairs (for following joins)
pairs_all$pair_id <- 1:nrow(pairs_all)
# Count occurences of elements of pairs
r <-
merge(pairs_all,
summarize(group_by(merge(merge(pairs_all,
transmute(data2, x, db1 = database)),
transmute(data2, y, db2 = database)), pair_id),
n_any = length(unique(union(db1,db2)))))
# Finally calculate the result
transmute(r, x, y, n_all/n_any)
答案 1 :(得分:0)
哇,这真糟糕!但是我已经编写了我上面提到的hack。对于任何在未来同样模糊的即兴网络比较中绊脚石的人。如果有人仍然知道可以简化的引用,那么在网络节点对中找到这种类型的自然组会更加可靠,请告诉我。 :)
#Calculate the score one row at a time
for (linenr in 1:length(data$i)){
count_pair = 0
count_one = 0
# Loop through datasets
for(setname in levels(data$database)){
subset <- subset(data, database == setname)
#Test whether either variable appears in dataset
if(sum(c(as.character(data$i[linenr]),as.character(data$j[linenr])) %in%
c(as.character(subset$i),as.character(subset$j))) > 0)
{count_one = count_one + 1}
for (line2nr in 1:length(subset$i)){
#Test whether dataset contains lines which have both elements of the original pair
if(sum(c(as.character(data$i[linenr]),as.character(data$j[linenr])) %in%
c(as.character(subset$i[line2nr]),as.character(subset$j[line2nr])))
== 2)
{count_pair = count_pair + 1}
}
}
#Simple ratio calculation
data$score[linenr] <- count_pair/count_one
}
frame <- data.frame(data$i,data$j,data$score)
#Remove database duplicates
result <- frame[!duplicated(frame),]
#This still doesn't deal with changed order duplicates, but does the job now.