我有一个表,每行中元素的数量不相等,每个元素的计数都为1或2,并附加到字符串中。我想为每个字符串创建一个存在/不存在的矩阵,但要包括计数(1,2),如果找不到该字符串,则放置一个零。
从此:
V1 V2 V3 V4 V5
1 A cat:2 dog:1 mouse:1 horse:2
2 B dog:2 mouse:2 dolphin:2
3 C horse:2
4 D cat:1 mouse:2 dolphin:2
对此:
cat dog mouse horse dolphin
A 2 1 1 2 0
B 0 2 2 0 2
C 0 0 0 2 0
D 1 0 2 0 2
我已经找到了类似问题的先前解决方案: Convert a dataframe to presence absence matrix
他们创建了一个0/1缺席矩阵,不包括计数。
样本数据:
structure(list(V1 = c("A", "B", "C", "D"),
V2 = c("cat:2", "dog:2", "horse:2", "cat:1"),
V3 = c("dog:1", "mouse:2", "", "mouse:2"),
V4 = c("mouse:1", "dolphin:2", "", "dolphin:2"),
V5 = c("horse:2", "", "", "")),
.Names = c("V1", "V2", "V3", "V4", "V5"),
class = "data.frame", row.names = c(NA, -4L))
答案 0 :(得分:2)
也许某些软件包可以使此操作更容易,但这是一个解决方案。对于大数据来说,它不会很快,但是可以做到:
#split the strings
tmp <- apply(DF[,-1], 1, strsplit, ":")
#extract the first strings
names <- lapply(tmp,function(x) c(na.omit(sapply(x, "[", 1))))
uniquenames <- unique(unlist(names))
#extract the numbers
reps <- lapply(tmp,function(x) as.numeric(na.omit(sapply(x, "[", 2))))
#make the numbers named vectors
res <- mapply(setNames, reps, names)
#subset the named vectors and combine result in a matrix
res <- do.call(rbind, lapply(res, "[",uniquenames))
#cosmetics
colnames(res) <- uniquenames
rownames(res) <- DF$V1
res[is.na(res)] <- 0
# cat dog mouse horse dolphin
#A 2 1 1 2 0
#B 0 2 2 0 2
#C 0 0 0 2 0
#D 1 0 2 0 2
答案 1 :(得分:1)
您可以在将数据融为长格式,然后使用计数作为值将其广播到更宽的区域后,使用separate
和tidyr
将动物从计数中分离出来(需要从字符转换为字符数字作为上一步)。
data %>%
melt("V1") %>%
separate(value, c("animal", "count"), ":", fill = "left") %>%
transform(count = as.numeric(count)) %>%
dcast(V1 ~ animal, value.var = "count", fun.aggregate = sum) %>%
select(-"NA")
# V1 cat dog dolphin horse mouse
# 1 A 2 1 0 2 1
# 2 B 0 2 2 0 2
# 3 C 0 0 0 2 0
# 4 D 1 0 2 0 2