神经网络的编码数据

时间:2016-01-23 23:57:28

标签: r neural-network

我在处理数据时遇到麻烦在神经网络中使用它,我的表看起来像这样:

drug.name    molecular.target         molecular.weight

drug1        target1                  225
drug2        target2,target3          210
drug3        target4,target1          120
drug4        target1,target2,target3  110
                     (...)

正如我之前发现的,我将能够使用其中的数据应该被转换为虚拟变量。我不知道如何处理列中的多个值与目标,以具有像这样的矩阵:

drug.name molecular.weight  target1  target2 target3(...)

drug1     225               1        0       0
drug2     225               0        1       1 
                          (...)

数据集很安静,所以我无法手动创建和填充新列。

我希望你理解我;) 塞巴斯蒂安

1 个答案:

答案 0 :(得分:0)

这是一个" hacky"解决方案,但它似乎适用于我测试的小集。有很多警告,但我不知道如何让他们离开。如果有人可以建议简化,我会更新答案。

注意:请参阅下面编辑中的解决方案,该解决方案可解决原始问题中的更改。紧接着下面的初始解决方案解决了原始问题。

library(dplyr)
library(tidyr)

### Using this input data set
drug_df <- read.table(header = TRUE, stringsAsFactors = FALSE, text = '
drug.name    molecular.target         molecular.weight
drug1        target1                  225
drug2        target2,target3          210
drug3        target4,target1          120
drug4        target1,target2,target3  110')
drug_df

##  drug.name        molecular.target molecular.weight
##1     drug1                 target1              225
##2     drug2         target2,target3              210
##3     drug3         target4,target1              120
##4     drug4 target1,target2,target3              110

### Process the input data frame
targetset <- sort(unique(unlist(sapply(drug_df$molecular.target, function(x) str_split(x, ',')))))

drug_df_new <-
    drug_df %>%
    separate(molecular.target, targetset, ',')      %>% # Create new target columns
    gather(key, val, -drug.name, -molecular.weight) %>% # Put in deep format for further manipulation
    select(-key)                                    %>% # Remove key column, it isn't needed.
    filter(!is.na(val))                             %>% # Only want drug targets, not empty targets
    rename(key = val)                               %>% # "val" will be used as new "key"
    group_by(drug.name)                             %>% # Group to get target count
    mutate(val = n())                               %>% # set val to target count
    spread(key, val, fill = 0)                          # Put in final format

drug_df_new
##  drug.name molecular.weight target1 target2 target3 target4
##      (chr)            (int)   (dbl)   (dbl)   (dbl)   (dbl)
##1     drug1              225       1       0       0       0
##2     drug2              210       0       2       2       0
##3     drug3              120       2       0       0       2
##4     drug4              110       3       3       3       0

修改

下面的替代解决方案解决了原始帖子中的更改。这将获得在帖子的编辑版本中指定的结果。当确认以下解决方案适用于较大的数据集时,将删除上述解决方案。

drug_df_new <-
     drug_df %>%
     separate(molecular.target, targetset, ',')      %>% # Create new target columns
     gather(key, val, -drug.name, -molecular.weight) %>% # Put in deep format for further manipulation
     mutate(new_val = ifelse(!is.na(val), 1, 0))     %>% # Create the new value 0 or 1
     select(-key)                                    %>% # Remove key column, it isn't needed.
     filter(!is.na(val))                             %>% # Remove lines where no target exists
     spread(val, new_val, fill = 0)                      # Put in longer format.

##  drug.name molecular.weight target1 target2 target3 target4
##1     drug1              225       1       0       0       0
##2     drug2              210       0       1       1       0
##3     drug3              120       1       0       0       1
##4     drug4              110       1       1       1       0