我的问题是关于文本挖掘和文本处理的。 我想根据我的数据建立一个共现矩阵。 我的数据是:
dat <- read.table(text="id_reférence id_paper
621107 621100
621100 621101
621107 621102
621109 621103
621105 621104
621103 621105
621109 621106
621106 621107
621107 621108
621106 621109", header=T)
expected <- matrix(0,10,10)
### Article 1 has been cited by article 2
expected[2, 1] <- 1
预先感谢:)
答案 0 :(得分:1)
# loop through the observations of dat
for(i in seq_len(nrow(dat))) {
# convert reference ids to integer and store in a vector
# example data requires this step, you may already have integers in your actual data
ref <- as.integer(strsplit(as.character(dat$id_reférence[i]), ",")[[1]])
# loop through the list of references
for(j in ref) {
# mark the citations using (row, column) ~ (i, j) pairs
expected[dat$id_paper[i], j] <- 1
}
}
expected
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] 0 1 0 0 0 0 0 0 0 0
# [2,] 0 0 0 1 0 0 0 1 0 0
# [3,] 1 0 0 0 1 0 0 0 0 0
# [4,] 0 0 0 0 0 0 0 1 0 0
# [5,] 0 0 0 1 1 0 0 0 1 0
# [6,] 0 0 1 0 0 0 0 1 0 0
# [7,] 0 1 0 1 0 0 0 0 0 0
# [8,] 0 0 0 0 0 1 0 0 1 0
# [9,] 0 0 0 0 0 0 0 0 0 1
# [10,] 1 0 0 1 0 0 0 0 1 0
答案 1 :(得分:0)
这里是使用data.table
的另一种方法。瓶颈可能是以下方法不会以sparseMatrix
结尾。根据数据集的大小,可能有必要检查一种针对稀疏数据对象的方法。
library(data.table)
setDT(dat)
# split id_reférence column into multiple rows by comma
# code for this step taken from: #https://stackoverflow.com/questions/13773770/split-comma-separated-strings-in-a-column-into-separate-rows
dat = dat[, strsplit(as.character(id_reférence), ",", fixed=TRUE),
by = .(id_paper, id_reférence)][, id_reférence := NULL][
, setnames(.SD, "V1", "id_reférence")]
# add value column for casting
dat[, cite:= 1]
# cast you data into long format
dat = dcast(dat, id_paper ~ id_reférence, fill = 0)[, id_paper:= NULL]