R DataFrame - 包含多个术语的列的一个热编码

时间:2016-09-29 19:21:47

标签: r dataframe one-hot-encoding

我有一个数据框,其中一列有多个值(以逗号分隔):

mydf <- structure(list(Age = c(99L, 10L, 40L, 15L),
                       Info = c("good, bad, sad", "nice, happy, joy", "NULL", "okay, nice, fun, wild, go"),
                       Target = c("Boy", "Girl", "Boy", "Boy")), 
                  .Names = c("Age", "Info", "Target"),
                  row.names = c(NA, 4L),
                  class = "data.frame")

> mydf
  Age                      Info Target
1  99            good, bad, sad    Boy
2  10          nice, happy, joy   Girl
3  40                      NULL    Boy
4  15 okay, nice, fun, wild, go    Boy

我想将信息列拆分为 one-hot-encoded 列,并将结果追加到Target列之外,例如:

  Age                      Info Target good bad sad nice ... NULL ..
1  99            good, bad, sad    Boy    1   1   1   0        0
2  10          nice, happy, joy   Girl    0   0   0   1        0
3  40                      NULL    Boy    0   0   0   0        1
4  15 okay, nice, fun, wild, go    Boy    0   0   0   0        0

在python中,我可以执行类似下面的操作,获取字典,然后使用它来分配列。

In [1]: import itertools

In [2]: values = ["good, bad, sad", "nice, happy, joy", "NULL",  "okay, nice, fun, wild, go"]

In [3]: terms = list(itertools.chain(*[v.split(", ") for v in values]))

In [4]: dictionary = {v:k for k,v in enumerate(terms)}

In [6]: dictionary
Out[6]: 
{'NULL': 6, 'bad': 1,
 'fun': 9, 'go': 11, 'good': 0, 'happy': 4,
 'joy': 5, 'nice': 8, 'okay': 7, 'sad': 2, 'wild': 10}

到目前为止,我可以在R

中做到这一点
> lapply(mydf["Info"], function(x) { strsplit(x, ", ") } )
$Info
$Info[[1]]
[1] "good" "bad"  "sad" 

$Info[[2]]
[1] "nice"  "happy" "joy"  

$Info[[3]]
[1] "NULL"

$Info[[4]]
[1] "okay" "nice" "fun"  "wild" "go"  

我没有得到如何将它转换为R中的字典,并使用它转换为One-Hot-Encoding的列。

我该如何解决这个问题?

1 个答案:

答案 0 :(得分:8)

分割“信息”后,mtabulate的一个选项为qdapTools。列,

library(qdapTools)
cbind(mydf, mtabulate(strsplit(mydf$Info, ", ")))
#Age                      Info Target bad fun go good happy joy nice NULL okay sad wild
#1  99            good, bad, sad    Boy   1   0  0    1     0   0    0    0    0   1    0
#2  10          nice, happy, joy   Girl   0   0  0    0     1   1    1    0    0   0    0
#3  40                      NULL    Boy   0   0  0    0     0   0    0    1    0   0    0
#4  15 okay, nice, fun, wild, go    Boy   0   1  1    0     0   0    1    0    1   0    1