将逗号分隔列转换为具有布尔值的列

时间:2015-06-08 16:42:28

标签: r csv dataframe

我在data.frame的一个名为services的列中有以下以逗号分隔的数据。

> dput(structure(df$services[1:5]))
list("Global Expense Management, Company Privacy Policy", "Removal Services, Global Expense Management", 
    "Removal Services, Exception & Cost Admin, Global Cost Estimate, Company Privacy Policy", 
    "Removal Services, Exception & Cost Admin, Ancillary Services, Global Cost Estimate, Global Expense Management, Perm Storage, Company Privacy Policy", 
    "Global Expense Management, Company Privacy Policy")

我想将此数据转换为数据框中的单独列,如果该行包含服务,则在该服务列下设置TRUE。否则,将值设置为FALSE。

例如,如果我希望我的数据框看起来像这样:

GlobalExpenseManagement    |    CompanyPrivacyPolicy   |   etc...
TRUE                            TRUE
TRUE                            FALSE
FALSE                           TRUE

我假设我必须拆分逗号sep值,将它们分组以删除重复项,然后将它们作为names(df)添加到我的数据帧中。但是,如果行包含该服务,我不知道如何迭代数据集并设置true / false。

有没有人有任何好的想法必须这样做?

编辑:合并数据

我现在正在尝试将新矩阵与现有数据框合并,以用新列对应的服务替换服务。我根据@ plafort的好答案尝试了这个:

names(df) <- headnames
rbind(mat, df)

然而,我收到此错误:

  

名称错误(df)&lt; - headnames:'names'属性[178]必须是   与矢量[7]

的长度相同

我也试过这个:

final <- data.frame(cbind(mat, df))

但是,似乎缺少df的列。如何合并matdf的列?

2 个答案:

答案 0 :(得分:3)

尝试:

splitup <- sapply(unlist(lst), strsplit, ', ')
headnames <- unique(unlist(splitup))
(mat <- t(unname(sapply(splitup, function(x) headnames %in% x))))

      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]
[1,]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
[2,]  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE
[3,] FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
[4,]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[5,]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

我们首先用逗号分割数据,然后使用unlist直接访问元素。如您所述,headnames会查找唯一的类别标题。最后一行首先将标题类别与每个列表项匹配,然后使用unname删除自动命名,并将数据转换回我们对t的喜好。

要在顶部添加名称,我们使用函数colnames指定先前定义为列标题的唯一名称。该顺序正确,因为这是用于进行行观察的headnames向量。

colnames(mat) <- headnames

Global Expense Management Company Privacy Policy
[1,]                      TRUE                   TRUE
[2,]                      TRUE                  FALSE
[3,]                     FALSE                   TRUE
[4,]                      TRUE                   TRUE
[5,]                      TRUE                   TRUE...

答案 1 :(得分:2)

我会从我的“splitstackshape”包中考虑cSplit_e。结果是二进制“1”和“0”而不是TRUEFALSE,但这应该很容易转换。

示例数据:

df <- data.frame(services = I(
  list("Global Expense Management, Company Privacy Policy", "Removal Services, Global Expense Management", 
       "Removal Services, Exception &amp; Cost Admin, Global Cost Estimate, Company Privacy Policy", 
       "Removal Services, Exception &amp; Cost Admin, Ancillary Services, Global Cost Estimate, Global Expense Management, Perm Storage, Company Privacy Policy", 
       "Global Expense Management, Company Privacy Policy")))

将“服务”列转换为vector而不是list

df$services <- unlist(df$services)

现在分开了:

library(splitstackshape)
cSplit_e(df, "services", ",", type = "character", fill = 0)
##                                                                                                                                                  services
## 1                                                                                                       Global Expense Management, Company Privacy Policy
## 2                                                                                                             Removal Services, Global Expense Management
## 3                                                              Removal Services, Exception &amp; Cost Admin, Global Cost Estimate, Company Privacy Policy
## 4 Removal Services, Exception &amp; Cost Admin, Ancillary Services, Global Cost Estimate, Global Expense Management, Perm Storage, Company Privacy Policy
## 5                                                                                                       Global Expense Management, Company Privacy Policy
##   services_Ancillary Services services_Company Privacy Policy services_Exception &amp; Cost Admin
## 1                           0                               1                                   0
## 2                           0                               0                                   0
## 3                           0                               1                                   1
## 4                           1                               1                                   1
## 5                           0                               1                                   0
##   services_Global Cost Estimate services_Global Expense Management services_Perm Storage
## 1                             0                                  1                     0
## 2                             0                                  1                     0
## 3                             1                                  0                     0
## 4                             1                                  1                     1
## 5                             0                                  1                     0
##   services_Removal Services
## 1                         0
## 2                         1
## 3                         1
## 4                         1
## 5                         0