Question

我在处理数据时遇到麻烦在神经网络中使用它，我的表看起来像这样：

drug.name    molecular.target         molecular.weight

drug1        target1                  225
drug2        target2,target3          210
drug3        target4,target1          120
drug4        target1,target2,target3  110
                     (...)

正如我之前发现的，我将能够使用其中的数据应该被转换为虚拟变量。我不知道如何处理列中的多个值与目标，以具有像这样的矩阵：

drug.name molecular.weight  target1  target2 target3(...)

drug1     225               1        0       0
drug2     225               0        1       1 
                          (...)

数据集很安静，所以我无法手动创建和填充新列。

我希望你理解我;）塞巴斯蒂安

Answer 1

这是一个＆＃34; hacky＆＃34;解决方案，但它似乎适用于我测试的小集。有很多警告，但我不知道如何让他们离开。如果有人可以建议简化，我会更新答案。

注意：请参阅下面编辑中的解决方案，该解决方案可解决原始问题中的更改。紧接着下面的初始解决方案解决了原始问题。

library(dplyr)
library(tidyr)

### Using this input data set
drug_df <- read.table(header = TRUE, stringsAsFactors = FALSE, text = '
drug.name    molecular.target         molecular.weight
drug1        target1                  225
drug2        target2,target3          210
drug3        target4,target1          120
drug4        target1,target2,target3  110')
drug_df

##  drug.name        molecular.target molecular.weight
##1     drug1                 target1              225
##2     drug2         target2,target3              210
##3     drug3         target4,target1              120
##4     drug4 target1,target2,target3              110

### Process the input data frame
targetset <- sort(unique(unlist(sapply(drug_df$molecular.target, function(x) str_split(x, ',')))))

drug_df_new <-
    drug_df %>%
    separate(molecular.target, targetset, ',')      %>% # Create new target columns
    gather(key, val, -drug.name, -molecular.weight) %>% # Put in deep format for further manipulation
    select(-key)                                    %>% # Remove key column, it isn't needed.
    filter(!is.na(val))                             %>% # Only want drug targets, not empty targets
    rename(key = val)                               %>% # "val" will be used as new "key"
    group_by(drug.name)                             %>% # Group to get target count
    mutate(val = n())                               %>% # set val to target count
    spread(key, val, fill = 0)                          # Put in final format

drug_df_new
##  drug.name molecular.weight target1 target2 target3 target4
##      (chr)            (int)   (dbl)   (dbl)   (dbl)   (dbl)
##1     drug1              225       1       0       0       0
##2     drug2              210       0       2       2       0
##3     drug3              120       2       0       0       2
##4     drug4              110       3       3       3       0

修改

下面的替代解决方案解决了原始帖子中的更改。这将获得在帖子的编辑版本中指定的结果。当确认以下解决方案适用于较大的数据集时，将删除上述解决方案。

drug_df_new <- drug_df %>% separate(molecular.target, targetset, ',') %>% # Create new target columns gather(key, val, -drug.name, -molecular.weight) %>% # Put in deep format for further manipulation mutate(new_val = ifelse(!is.na(val), 1, 0)) %>% # Create the new value 0 or 1 select(-key) %>% # Remove key column, it isn't needed. filter(!is.na(val)) %>% # Remove lines where no target exists spread(val, new_val, fill = 0) # Put in longer format. ## drug.name molecular.weight target1 target2 target3 target4 ##1 drug1 225 1 0 0 0 ##2 drug2 210 0 1 1 0 ##3 drug3 120 1 0 0 1 ##4 drug4 110 1 1 1 0

神经网络的编码数据

1 个答案: