在新数据结构中,每个segmentId都应该转换为列名。每个人ID应该仍然每个id有一行。 segmentId单元格是segmentid在逗号分隔列表中出现的次数的计数。以下示例。
原样:我正在尝试以这种形式转换数据:
| PersonID | SegmentId | |----------|---------------| | 1001 | 50,61,72,42,1 | | 1002 | 49,33,24,72 | | 1003 | 22,22,23,99,2 |
待于:以这种形式:
| PersonID | 1 | 2 | 22 | 23 | 24 | 33 | 42 | 49 | 50 | 61 | 72 | 99 | |----------|---|---|----|----|----|----|----|----|----|----|----|----| | 1001 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | | 1002 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | | 1003 | 0 | 1 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
这是dput:
structure(list(V1 = structure(c(8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L
), .Label = c("2", "3", "4", "5", "6", "7", "8", "PersonID"), class = "factor"),
V2 = structure(c(8L, 3L, 2L, 5L, 7L, 4L, 1L, 6L), .Label = c("10038,10068,1015,103587,1042,108930,11012,11336,11445,11446,11448,11459,11485,12",
"10038,10093,1015,108930,11336,11450,11459,11737,11738,12",
"10039,10069,108930,11336,11484,11485,11737,11738,12", "10051,108930,11336,12",
"10055,11484,12", "1042,108930,11336,12", "108930,11336,11453,11459,12",
"segments"), class = "factor")), .Names = c("V1", "V2"), class = "data.frame", row.names = c(NA,
-8L))
答案 0 :(得分:1)
正如评论中所建议的那样,dput
的输出似乎搞砸了。因此,我认为输入数据与您提到的匹配,为方便起见,我将person id作为字符串(在输入数据框中编码为因子):
Input <- data.frame(
PersonID = c("1001", "1002", "1003"),
segments = c("50,61,72,42,1", "49,33,24,72", "22,22,23,99,2")
)
我的第一个想法是在分割逗号分隔的值后简单地扩展数据,然后使用dcast
(包reshape2
)将数据放入宽格式:
# Parse the data, such that each row now corresponds to a single 'segment' value
Data <- do.call(rbind, lapply(1:nrow(Input), function(i)
data.frame(
PersonID = as.numeric(as.character(Input[[i, "PersonID"]])),
segments = as.numeric(strsplit(as.character(Input[[i, "segments"]]), ",")[[1]])
)
))
# Convert the data to long format, putting the person id as a column
library(reshape2)
Results <- as.matrix(dcast(Data, PersonID ~ segments, value.var = "segments", fun.aggregate = length, fill = 0))
但是这并没有利用结果是稀疏矩阵的事实,根据您正在使用的实际数据,这可能基本上是必须的。由于你在标题中提到“稀疏”,这里有一个替代解决方案,虽然有点长,但结果存储在稀疏矩阵中(通过包Matrix
)。此解决方案利用函数sparseMatrix
接受的输入格式(有关详细信息和示例,请参阅包的文档):
# Parse the data, such that each row now corresponds to a single 'segment' value
# Both the person id and the segment are stored as factors (this is a key point)
Source <- do.call(rbind, lapply(1:nrow(Input), function(i) # person id and segments as factors
data.frame(
PersonID = as.character(Input[[i, "PersonID"]]),
segments = strsplit(as.character(Input[[i, "segments"]]), ",")[[1]]
)
))
library(Matrix)
Results_sparse <- sparseMatrix(
i = as.numeric(Source$PersonID),
j = as.numeric(Source$segments),
x = rep(1, length.out = nrow(Source)) # will be automatically "aggregated"
)
# Use the info on the person id and segments (numeric values and the
# strings for the factors) to used to set column names and add column
# corresponding to the person ids
colnames(Results_sparse) <- levels(unique(Source$segments))
Results_sparse <- cbind(
PersonID = as.numeric(levels(unique(Source$PersonID))),
Results_sparse
)
答案 1 :(得分:1)
从tguzella的答案中取Input
。
Input <- data.frame(
PersonID = c("1001", "1002", "1003"),
segments = c("50,61,72,42,1", "49,33,24,72", "22,22,23,99,2")
)
1:拆分段以分隔变量
library(splitstackshape)
dd<-cSplit(Input, 'segments', sep=",", type.convert=FALSE)
2:融化制作单个变量
library(reshape2)
dd2<-as.data.frame(melt(dd, id.var="PersonID"))
dd2<-na.omit(dd2[,-2])
3:将其作为矩阵
dcast(data=dd2, PersonID ~ value, value.var="value")
Aggregation function missing: defaulting to length
PersonID 1 2 22 23 24 33 42 49 50 61 72 99
1 1001 1 0 0 0 0 0 1 0 1 1 1 0
2 1002 0 0 0 0 1 1 0 1 0 0 1 0
3 1003 0 1 2 1 0 0 0 0 0 0 0 1