我一直在尝试使用gsub用边缘列表中的简单整数替换标识符键。边缘列表由个人和他们的连接列表组成(个体可变长度)。不幸的是,由于我的数据集超过300K行(因此需要300K +搜索和替换操作),因此迭代运行将需要将近一周的时间才能完成。当前代码能够产生所需的输出,但我试图找到一种更有效的方法。有谁知道一个更好的方法来产生类似的输出?我当前的代码,一些假设的样本数据和样本输出如下:
示例数据:
Person Connection_list
ENJAK IDFJA, FDAKD, AODMK
JBJDF KJDFA
LAFMD JBJDF, KAOJD, ENJAK,FKJSE,IDFJA, AKSKE, FNAFJ, KJDFA, ATNFN, ADOFA, ODIJA, AODMK, NAGJA, NFAKD, FDAKD, KDSFN
ADOFA JDFKA, KAOJD, NAGJA
KJDFA ENJAK, ATNFN, NFAKD, ADOFA, AODMK, JDFKA, LAFMD, ODIJA, FNAFJ, KDSFN, JBJDF, FJKAS, FKJSE, AKSKE, NAGJA
IDFJA AKSKE, KJDFA, FJKAS, ADOFA
KDSFN KAOJD, ADOFA, AKSKE, FDAKD, NFAKD, FKJSE, NAGJA, JDFKA, ODIJA, FJKAS, ATNFN, JBJDF, FNAFJ, KJDFA, LAFMD, ENJAK
AKSKE ADOFA, ODIJA, KAOJD, JBJDF, ENJAK, AODMK, FDAKD, IDFJA, NAGJA, KJDFA
NAGJA KAOJD, AKSKE
ODIJA ADOFA, FDAKD, FKJSE, ATNFN, IDFJA, NAGJA, KAOJD
FKJSE JBJDF, NAGJA, KDSFN, KAOJD, LAFMD, KJDFA, NFAKD, FDAKD, ENJAK, ATNFN, FNAFJ, ODIJA, ADOFA, AODMK, FJKAS, AKSKE, IDFJA
FDAKD ADOFA, ODIJA, FKJSE, NAGJA, NFAKD, KJDFA, JBJDF, ATNFN, AODMK, AKSKE, KDSFN, JDFKA, LAFMD
NFAKD ADOFA, KJDFA, AKSKE, KDSFN, FJKAS, JBJDF, JDFKA
FJKAS FKJSE, AKSKE, FDAKD, NAGJA, ADOFA, ENJAK, FNAFJ, KDSFN, NFAKD, ATNFN, AODMK, KAOJD, JBJDF, JDFKA, LAFMD, IDFJA
JDFKA AKSKE, KJDFA, IDFJA
ATNFN AODMK, IDFJA, AKSKE
KAOJD ENJAK, FJKAS, FKJSE, AKSKE, NFAKD, LAFMD, JDFKA, KDSFN, ODIJA
AODMK AKSKE, FNAFJ, KAOJD, JDFKA, LAFMD, FDAKD, KDSFN, ENJAK, FJKAS, JBJDF, FKJSE, IDFJA, ATNFN
FNAFJ JBJDF, ADOFA, NFAKD, ODIJA, KAOJD, FKJSE, LAFMD, AKSKE, KDSFN, IDFJA, FNAFJ, ENJAK
当前代码:
for (i in 1:dim(data)[1]){
data$key[i] <- i
data[,2] <- gsub(data[i,1],as.character(i),data[,2])
}
期望/当前输出:
key Person Connection_list
1 ENJAK 6,12,1,18
2 JBJDF 5
3 LAFMD 2,17,3,1,11,6,8,19,5,16,4,10,18,9,13,12,7
4 ADOFA 15,17,9,4
5 KJDFA 1,5,16,13,4,18,15,3,10,19,7,2,14,11,8,9
6 IDFJA 8,5,14,4,6
7 KDSFN 17,4,8,12,13,11,9,15,7,10,14,16,2,19,5,3,1
8 AKSKE 4,10,17,2,1,18,12,6,9,5
9 NAGJA 17,8
10 ODIJA 4,12,11,16,6,9,17
11 FKJSE 2,9,7,17,3,5,13,12,11,1,16,19,10,4,18,14,8,6
12 FDAKD 4,10,11,9,12,13,5,2,16,18,8,7,15,3
13 NFAKD 4,5,8,7,14,2,15
14 FJKAS 11,8,14,12,9,4,1,19,7,13,16,18,17,2,15,3,6
15 JDFKA 8,5,15,6
16 ATNFN 16,18,6,8
17 KAOJD 1,14,11,8,13,3,15,7,10
18 AODMK 8,19,17,15,3,12,7,1,14,2,11,6,16,18
19 FNAFJ 2,4,13,10,17,11,3,8,7,6,19,1
答案 0 :(得分:0)
不是明确的代码来解决你的问题,而是我会使用的策略。
如果我读得正确,Person是一个唯一的标识符,Connection_list是你们之间的边缘。如果您的值是因子,并且在分析流中需要进一步计算数值,则可以使用因子整数值,显式转换为整数。
首先,我将Connection_list转换为多个列,例如:Split column into multiple columns R。
然后,在此之后,您的列被识别为包含因子值,
aframe2 <- as.data.frame(lapply(aframe1, factor))
您应该可以使用as.numeric(as.character(f))
行中的某些内容从这些因素中检索数值。
答案 1 :(得分:0)
我最终以一种全面的方式解决问题。因为每个用户都有不同长度的朋友,所以我使用dplyr
包来拆分每一行并将拆分函数(使用stringr
包)应用到每一行:制作一个“长”边缘列表,以及然后我在将列表转换为它们的因子等价物之后将结果重新组合回原始格式。重组的代码非常混乱,我确信可能有更高效的方法,但代码看起来像这样:
library(dplyr)
library(stringr)
# User defined split fuction
longedge <- function(df){
user <- df$user_id
cnx <- df$friends
split <- as.data.frame(ifelse(cnx=="",NA,str_split(cnx,", ")))
combine <- as.data.frame(cbind(user,split),stringAsFactors=FALSE)
colnames(combine) <- c("user_id", "friend")
return(combine)
}
# Creating long edgelist
edgelist <- edgelist %>%
rowwise() %>%
do(longedge(.)) %>%
rbind()
# Convert to number
edgelist$friend <- as.numeric(as.factor(edgelist$friend))
# Create count of No. of connections
edgelist1 <- edgelist %>%
group_by(user_id) %>%
summarize(friend_count=n())
# Recreate 'wide' connection list
friend_list <- rep(NA,dim(edgelist1)[1])
for (i in 1:dim(edgelist1)[1]){
if(i==1){j<-1}
x <- j + edgelist1$friend_count[i]
friend_list[i] <- as.character(edgelist$friend[j])
j <- j+1
while(j < x ){
friend_list[i] <- paste(friend_list[i],edgelist$friend[j],sep=", ")
j <- j+1
}
}
# Recombine
edgelist1 <- cbind(edgelist1,friend_list)