我有一个大的数据帧,305k行,有两个键和一个数据列,如下所示:
我正在尝试使用R中的以下代码将其转换为稀疏矩阵:
#convert to factors
data$RID = as.factor(data$RID)
data$HID = as.factor(data$HID)
data$VALUE = as.numeric(data$VALUE)
str(data)
#remove nas
data = na.omit(data)
#create sparse matrix
X = with(data,sparseMatrix(i=RID,
j=HID,
x=VALUE,
dimnames=list(levels(RID), levels(HID))))
产生以下错误消息:
Error in sparseMatrix(i = RID, j = HID, x = VALUE, dimnames = list(levels(RID), :
NA's in (i,j) are not allowed
In addition: Warning messages:
1: In Ops.factor(i, !(m.i || i1)) : ‘+’ not meaningful for factors
2: In Ops.factor(j, !(m.j || i1)) : ‘+’ not meaningful for factors
我已删除了NA,所以我不确定为什么会出现错误的NA? 它也引用了因素中的'+',但我检查了所有36k因子,那里没有'+'?
有谁知道解决方案是什么?
我已经包含了下面前20行数据的快照,因此您可以重新产生问题:
"RID" "HID" "VALUE"
"361838" "620631" 76.55
"361838" "620671" 82.61
"361838" "620787" 57.73
"361838" "621146" 58.65
"361838" "637825" 64.15
"361838" "637859" 82.79
"361838" "641254" 50.38
"361838" "642105" 72.88
"361838" "646469" 45.79
"361838" "648400" 82.06
"395855" "301340" -5.12
"395855" "649304" 41.88
"395855" "650324" -30.83
"395855" "657458" 46.47
"395855" "658028" -0.53
"395855" "659504" 28.84
"395855" "660506" 29.03
"395855" "660519" 14.16
"395855" "660521" -38.17
"395855" "660547" 35.45
虽然当我看这些因素时,我得到以下结论:
> str(data)
'data.frame': 20 obs. of 3 variables:
$ RID : Factor w/ 30608 levels "361838","395855",..: 1 1 1 1 1 1 1 1 1 1 ...
$ HID : Factor w/ 37399 levels "2018","7990",..: 11604 11624 11709 11740 14031 14049 15086 15457 16821 17270 ...
$ VALUE: num 76.5 82.6 57.7 58.6 64.2 ...
答案 0 :(得分:1)
在致电RID
时尝试将HID
和sparseMatrix
转换为数字:
X <- with(data, sparseMatrix(i=as.numeric(RID),
j=as.numeric(HID),
x=as.numeric(VALUE),
dimnames=list(levels(RID), levels(HID))))
RID
和HID
首先需要转换为因子,然后在调用sparseMatrix
时转换为数字的原因是否则sparseMatrix
将取值RID
HID
和test <- data.frame(x = 101:105, y = 201:205, v = 1:25)
dim(with(test, sparseMatrix(i = x, j = y, x = v)))
# [1] 105 205
作为行/列的索引。换句话说,
x
为您提供了一个105 x 205的矩阵,即使我们想到的是将y
和mean = 0.5
视为键,但它只是一个5 x 5矩阵。