我在处理数据时遇到麻烦在神经网络中使用它,我的表看起来像这样:
drug.name molecular.target molecular.weight
drug1 target1 225
drug2 target2,target3 210
drug3 target4,target1 120
drug4 target1,target2,target3 110
(...)
正如我之前发现的,我将能够使用其中的数据应该被转换为虚拟变量。我不知道如何处理列中的多个值与目标,以具有像这样的矩阵:
drug.name molecular.weight target1 target2 target3(...)
drug1 225 1 0 0
drug2 225 0 1 1
(...)
数据集很安静,所以我无法手动创建和填充新列。
我希望你理解我;) 塞巴斯蒂安
答案 0 :(得分:0)
这是一个" hacky"解决方案,但它似乎适用于我测试的小集。有很多警告,但我不知道如何让他们离开。如果有人可以建议简化,我会更新答案。
注意:请参阅下面编辑中的解决方案,该解决方案可解决原始问题中的更改。紧接着下面的初始解决方案解决了原始问题。
library(dplyr)
library(tidyr)
### Using this input data set
drug_df <- read.table(header = TRUE, stringsAsFactors = FALSE, text = '
drug.name molecular.target molecular.weight
drug1 target1 225
drug2 target2,target3 210
drug3 target4,target1 120
drug4 target1,target2,target3 110')
drug_df
## drug.name molecular.target molecular.weight
##1 drug1 target1 225
##2 drug2 target2,target3 210
##3 drug3 target4,target1 120
##4 drug4 target1,target2,target3 110
### Process the input data frame
targetset <- sort(unique(unlist(sapply(drug_df$molecular.target, function(x) str_split(x, ',')))))
drug_df_new <-
drug_df %>%
separate(molecular.target, targetset, ',') %>% # Create new target columns
gather(key, val, -drug.name, -molecular.weight) %>% # Put in deep format for further manipulation
select(-key) %>% # Remove key column, it isn't needed.
filter(!is.na(val)) %>% # Only want drug targets, not empty targets
rename(key = val) %>% # "val" will be used as new "key"
group_by(drug.name) %>% # Group to get target count
mutate(val = n()) %>% # set val to target count
spread(key, val, fill = 0) # Put in final format
drug_df_new
## drug.name molecular.weight target1 target2 target3 target4
## (chr) (int) (dbl) (dbl) (dbl) (dbl)
##1 drug1 225 1 0 0 0
##2 drug2 210 0 2 2 0
##3 drug3 120 2 0 0 2
##4 drug4 110 3 3 3 0
修改强>
下面的替代解决方案解决了原始帖子中的更改。这将获得在帖子的编辑版本中指定的结果。当确认以下解决方案适用于较大的数据集时,将删除上述解决方案。
drug_df_new <-
drug_df %>%
separate(molecular.target, targetset, ',') %>% # Create new target columns
gather(key, val, -drug.name, -molecular.weight) %>% # Put in deep format for further manipulation
mutate(new_val = ifelse(!is.na(val), 1, 0)) %>% # Create the new value 0 or 1
select(-key) %>% # Remove key column, it isn't needed.
filter(!is.na(val)) %>% # Remove lines where no target exists
spread(val, new_val, fill = 0) # Put in longer format.
## drug.name molecular.weight target1 target2 target3 target4
##1 drug1 225 1 0 0 0
##2 drug2 210 0 1 1 0
##3 drug3 120 1 0 0 1
##4 drug4 110 1 1 1 0