R - 如何将长数据数据帧转换为稀疏矩阵

时间:2017-03-12 17:21:10

标签: r matrix dataframe sparse-matrix reshape

我有一个大的数据帧,305k行,有两个键和一个数据列,如下所示:

enter image description here

我正在尝试使用R中的以下代码将其转换为稀疏矩阵:

#convert to factors
data$RID   = as.factor(data$RID)
data$HID   = as.factor(data$HID)
data$VALUE = as.numeric(data$VALUE)
str(data)

#remove nas
data = na.omit(data)

#create sparse matrix
X = with(data,sparseMatrix(i=RID, 
                           j=HID, 
                           x=VALUE,
                           dimnames=list(levels(RID), levels(HID))))

产生以下错误消息:

Error in sparseMatrix(i = RID, j = HID, x = VALUE, dimnames = list(levels(RID),  : 
  NA's in (i,j) are not allowed
In addition: Warning messages:
1: In Ops.factor(i, !(m.i || i1)) : ‘+’ not meaningful for factors
2: In Ops.factor(j, !(m.j || i1)) : ‘+’ not meaningful for factors

我已删除了NA,所以我不确定为什么会出现错误的NA? 它也引用了因素中的'+',但我检查了所有36k因子,那里没有'+'?

有谁知道解决方案是什么?

我已经包含了下面前20行数据的快照,因此您可以重新产生问题:

"RID" "HID" "VALUE"
"361838" "620631" 76.55
"361838" "620671" 82.61
"361838" "620787" 57.73
"361838" "621146" 58.65
"361838" "637825" 64.15
"361838" "637859" 82.79
"361838" "641254" 50.38
"361838" "642105" 72.88
"361838" "646469" 45.79
"361838" "648400" 82.06
"395855" "301340" -5.12
"395855" "649304" 41.88
"395855" "650324" -30.83
"395855" "657458" 46.47
"395855" "658028" -0.53
"395855" "659504" 28.84
"395855" "660506" 29.03
"395855" "660519" 14.16
"395855" "660521" -38.17
"395855" "660547" 35.45

虽然当我看这些因素时,我得到以下结论:

> str(data)
'data.frame':   20 obs. of  3 variables:
 $ RID  : Factor w/ 30608 levels "361838","395855",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ HID  : Factor w/ 37399 levels "2018","7990",..: 11604 11624 11709 11740 14031 14049 15086 15457 16821 17270 ...
 $ VALUE: num  76.5 82.6 57.7 58.6 64.2 ...

1 个答案:

答案 0 :(得分:1)

在致电RID时尝试将HIDsparseMatrix转换为数字:

X <- with(data, sparseMatrix(i=as.numeric(RID), 
                       j=as.numeric(HID), 
                       x=as.numeric(VALUE),
                       dimnames=list(levels(RID), levels(HID))))

RIDHID首先需要转换为因子,然后在调用sparseMatrix时转换为数字的原因是否则sparseMatrix将取值RID HIDtest <- data.frame(x = 101:105, y = 201:205, v = 1:25) dim(with(test, sparseMatrix(i = x, j = y, x = v))) # [1] 105 205 作为行/列的索引。换句话说,

x

为您提供了一个105 x 205的矩阵,即使我们想到的是将ymean = 0.5视为键,但它只是一个5 x 5矩阵。