我需要编写一个R代码,它将执行以下操作:
操作前的数据(列)如下所示:
jobTitle
1 <NA>
2 <NA>
3 <NA>
4 Functional Architect, Business Technology
5 <NA>
6 <NA>
7 <NA>
8 <NA>
9 <NA>
10 Founder and President
11 Product Manager
12 <NA>
13 <NA>
14 <NA>
15 Head of Customer Experience & Online Sales
16 <NA>
17 <NA>
18 Founder and President
19 <NA>
20 <NA>
21 Product Manager
22 <NA>
23 Customer Value Manager
24 <NA>
25 Lead Software Developer
...
我需要的输出是:
Founder and President Product Manager
0 1
1 0
0 1
1 0
我得到的输出是:
Founder and President Product Manager Founder and President Product Manager
0 1 0 0
1 0 0 0
0 0 1 0
0 0 0 1
我的代码是:
library(plyr)
library(stringr)
library(gdata)
library(readxl)
train <- read_excel("data.xlsx")
#looping through the jobTitle column
for(i in 1:sum(nrow(train[4]))){
if ((!is.na(train[i,4])) {
#split every value by the comma, convert to lower case
list2char <- strsplit(tolower(train$jobTitle[i]),",", fixed = T)
for(j in 1:length(list2char[[1]])) {
#populate the current observation for the newly created column with 1
if(!(list2char[[1]][j] %in% names(train))){
#if the name does not match existing column name, create a new column and assign 1
train[i, str_trim(list2char[[1]][j])] <- 1
}else{
#if the name matches an existing column name, assign 1 to that column
}
}
}
}
#replace all NAs with 0s
train[is.na(train)] <- 0
答案 0 :(得分:0)
我认为你试图用逗号分隔的字符串计算每个变量的频率?
s<-data.frame(A=c("A1,B", "A2,C1"),B=c("B1,B2","C1,A1"), C=c("C1,C2,C3","C4"))
# A B C
#1 A1,B B1,B2 C1,C2,C3
#2 A2,C1 C1,A1 C4
table( unlist(apply(s,1, function(s.row) {
strsplit(s.row,",")
})) )
#A1 A2 B B1 B2 C1 C2 C3 C4
#2 1 1 1 1 3 1 1 1