使用R将逗号分隔的列表值转换为稀疏矩阵

时间:2015-09-17 19:23:03

标签: r

在新数据结构中,每个segmentId都应该转换为列名。每个人ID应该仍然每个id有一行。 segmentId单元格是segmentid在逗号分隔列表中出现的次数的计数。以下示例。

原样:我正在尝试以这种形式转换数据:

| PersonID | SegmentId     |
|----------|---------------|
| 1001     | 50,61,72,42,1 |
| 1002     | 49,33,24,72   |
| 1003     | 22,22,23,99,2 |

待于:以这种形式:

| PersonID | 1 | 2 | 22 | 23 | 24 | 33 | 42 | 49 | 50 | 61 | 72 | 99 |
|----------|---|---|----|----|----|----|----|----|----|----|----|----|
| 1001     | 1 | 0 | 0  | 0  | 0  | 0  | 1  | 0  | 1  | 1  | 1  | 0  |
| 1002     | 0 | 0 | 0  | 0  | 1  | 1  | 0  | 1  | 0  | 0  | 1  | 0  |
| 1003     | 0 | 1 | 2  | 1  | 0  | 0  | 0  | 0  | 0  | 0  | 0  | 1  |

这是dput:

structure(list(V1 = structure(c(8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L
), .Label = c("2", "3", "4", "5", "6", "7", "8", "PersonID"), class = "factor"), 
    V2 = structure(c(8L, 3L, 2L, 5L, 7L, 4L, 1L, 6L), .Label = c("10038,10068,1015,103587,1042,108930,11012,11336,11445,11446,11448,11459,11485,12", 
    "10038,10093,1015,108930,11336,11450,11459,11737,11738,12", 
    "10039,10069,108930,11336,11484,11485,11737,11738,12", "10051,108930,11336,12", 
    "10055,11484,12", "1042,108930,11336,12", "108930,11336,11453,11459,12", 
    "segments"), class = "factor")), .Names = c("V1", "V2"), class = "data.frame", row.names = c(NA, 
-8L))

2 个答案:

答案 0 :(得分:1)

正如评论中所建议的那样,dput的输出似乎搞砸了。因此,我认为输入数据与您提到的匹配,为方便起见,我将person id作为字符串(在输入数据框中编码为因子):

Input <- data.frame(
    PersonID = c("1001", "1002", "1003"), 
    segments = c("50,61,72,42,1", "49,33,24,72", "22,22,23,99,2")
)

我的第一个想法是在分割逗号分隔的值后简单地扩展数据,然后使用dcast(包reshape2)将数据放入宽格式:

# Parse the data, such that each row now corresponds to a single 'segment' value
Data <- do.call(rbind, lapply(1:nrow(Input), function(i) 
    data.frame(
        PersonID = as.numeric(as.character(Input[[i, "PersonID"]])), 
        segments = as.numeric(strsplit(as.character(Input[[i, "segments"]]), ",")[[1]])
    )
))

# Convert the data to long format, putting the person id as a column
library(reshape2)
Results <- as.matrix(dcast(Data, PersonID ~ segments, value.var = "segments", fun.aggregate = length, fill = 0))

但是这并没有利用结果是稀疏矩阵的事实,根据您正在使用的实际数据,这可能基本上是必须的。由于你在标题中提到“稀疏”,这里有一个替代解决方案,虽然有点长,但结果存储在稀疏矩阵中(通过包Matrix)。此解决方案利用函数sparseMatrix接受的输入格式(有关详细信息和示例,请参阅包的文档):

# Parse the data, such that each row now corresponds to a single 'segment' value
# Both the person id and the segment are stored as factors (this is a key point)
Source <- do.call(rbind, lapply(1:nrow(Input), function(i) # person id and segments as factors
    data.frame(
        PersonID = as.character(Input[[i, "PersonID"]]), 
        segments = strsplit(as.character(Input[[i, "segments"]]), ",")[[1]]
    )
))

library(Matrix)
Results_sparse <- sparseMatrix(
    i = as.numeric(Source$PersonID), 
    j = as.numeric(Source$segments), 
    x = rep(1, length.out = nrow(Source)) # will be automatically "aggregated"
)

# Use the info on the person id and segments (numeric values and the 
# strings for the factors) to used to set column names and add column 
# corresponding to the person ids
colnames(Results_sparse) <- levels(unique(Source$segments))
Results_sparse <- cbind(
    PersonID = as.numeric(levels(unique(Source$PersonID))), 
    Results_sparse
)

答案 1 :(得分:1)

从tguzella的答案中取Input

Input <- data.frame(
    PersonID = c("1001", "1002", "1003"), 
    segments = c("50,61,72,42,1", "49,33,24,72", "22,22,23,99,2")
)

1:拆分段以分隔变量

library(splitstackshape)
dd<-cSplit(Input, 'segments', sep=",", type.convert=FALSE)

2:融化制作单个变量

library(reshape2)

dd2<-as.data.frame(melt(dd, id.var="PersonID"))
dd2<-na.omit(dd2[,-2])

3:将其作为矩阵

dcast(data=dd2, PersonID ~ value, value.var="value")


Aggregation function missing: defaulting to length
  PersonID 1 2 22 23 24 33 42 49 50 61 72 99
1     1001 1 0  0  0  0  0  1  0  1  1  1  0
2     1002 0 0  0  0  1  1  0  1  0  0  1  0
3     1003 0 1  2  1  0  0  0  0  0  0  0  1