Question

我有一个具有这种结构的数据集：

  region1 region2 region3
1      10       5       5
2       8      10       8
3      13      15      12
4       3      17      11
5      17               9
6      12              15
7       4              
8      18              
9       1

我需要：

   item region1 region2 region3
1     1       1       0       0
2     3       1       0       0
3     4       1       0       0
4     5       0       1       1
5     8       1       0       1
6     9       0       0       1
7    10       1       1       0
8    11       0       0       1
9    12       1       0       1
10   13       1       0       0
11   15       0       1       1
12   17       1       1       0
13   18       1       0       0

计划是获得一个不同的项目列表，将每个区域连接为自己的列，并将匹配替换为1，缺少0;但我必须错过R合并的关键点，放弃感兴趣的主要栏目。任何意见是极大的赞赏！我更喜欢R解决方案，但我的下一步是研究sqldf包。

#read in data
regions <- read.csv("c:/data/regions.csv")

#get unique list of items from all regions
items <- na.omit(unique(stack(regions)[1]))

#merge distinct items with each region, replace matches with 1, missings with 0
merge.test <- merge(items,regions,by.x="values", by.y=c("region1"), all=TRUE)

Answer 1

帮助提供可重复的示例（即，为我们提供一个简单的复制粘贴命令来构建您的示例数据）。

你没有说，所以我猜你的数据可能在列表中？

dat <- list(region1=c(10, 8, 3, 17, 12, 4, 18, 1),
            region2=c(5,10,15,17),
            region3=c(5,8,12,11,9,15))

首先找到所有项目（可能不需要排序，我只是因为你的排序而这样做了）

ids <- sort(unique(unlist(dat)))

然后对于每个区域，只需查看唯一ID列表是否在该区域中，将逻辑TRUE / FALSE强制为0和1（如果可以，则可以保留为T / F）

data.frame(ids,
    region1=as.integer(ids %in% dat$region1),
    region2=as.integer(ids %in% dat$region2),
    region3=as.integer(ids %in% dat$region3))

如果您只有3个区域可以，那么如果您有更多区域，您可能希望自动进行输入：

cols <- lapply(dat, function (region) as.integer(ids %in% region))
cols$id <- ids
df <- do.call(data.frame, cols)

其中do.call调用data.frame函数并将列表cols作为其（命名）参数，即它只是

data.frame(id=..., region1=..., region2=..., region3=...)

如果您的原始dat是CSV并且每列都有NA值，则可能需要根据需要插入na.omit。

Answer 2

现有答案很好，但似乎很复杂。只需尝试stack + table：

table(stack(dat))
#       ind
# values region1 region2 region3
#     1        1       0       0
#     3        1       0       0
#     4        1       0       0
#     5        0       1       1
#     8        1       0       1
#     9        0       0       1
#     10       1       1       0
#     11       0       0       1
#     12       1       0       1
#     15       0       1       1
#     17       1       1       0
#     18       1       0       0

我也会想出一个问题，考虑到你目前的方法，你实际上有一个data.frame而不是list：

DAT <- dat
Len <- max(sapply(DAT, length))
DAT <- data.frame(lapply(DAT, function(x) { length(x) <- Len; x }))

在这种情况下，解决方案没有区别：

table(stack(DAT))
#       ind
# values region1 region2 region3
#     1        1       0       0
#     3        1       0       0
#     4        1       0       0
#     5        0       1       1
#     8        1       0       1
#     9        0       0       1
#     10       1       1       0
#     11       0       0       1
#     12       1       0       1
#     15       0       1       1
#     17       1       1       0
#     18       1       0       0

Answer 3

使用@ mathematical.coffee的示例和qdap：

dat <- list(region1=c(10, 8, 3, 17, 12, 4, 18, 1),
            region2=c(5,10,15,17),
            region3=c(5,8,12,11,9,15))

library(qdap)
matrix2df(t(mtabulate(dat)), "item")

您可能需要使用以下内容进行扩展：

FUN <- function(x) as.numeric(x > 0)
matrix2df(apply(t(mtabulate(dat)), 2, FUN), "item")

如果向量中有多个项目。

R中的数据重组（短列表到二进制）

3 个答案: