R:如何将字符串与现有列名列表进行比较?

时间:2016-12-15 22:32:08

标签: r parsing compare

我需要编写一个R代码,它将执行以下操作:

  • 使用循环
  • 遍历列
  • 用逗号分隔每个值并将它们分配给变量
  • 将该变量中的值与现有列名进行比较
  • 如果列名称不存在,请创建一个新列,每个字符对应一个逗号分隔值
  • 填充' 1'进入新专栏的观察
  • 如果列名存在,请添加' 1'使用该名称
  • 的现有列的​​观察值

操作前的数据(列)如下所示:

                                     jobTitle
1                                        <NA>
2                                        <NA>
3                                        <NA>
4   Functional Architect, Business Technology
5                                        <NA>
6                                        <NA>
7                                        <NA>
8                                        <NA>
9                                        <NA>
10                      Founder and President
11                            Product Manager
12                                       <NA>
13                                       <NA>
14                                       <NA>
15 Head of Customer Experience & Online Sales
16                                       <NA>
17                                       <NA>
18                      Founder and President
19                                       <NA>
20                                       <NA>
21                            Product Manager
22                                       <NA>
23                     Customer Value Manager
24                                       <NA>
25                    Lead Software Developer
  ...

我需要的输出是:

Founder and President  Product Manager
       0                       1        
       1                       0      
       0                       1
       1                       0

我得到的输出是:

Founder and President  Product Manager  Founder and President  Product Manager
       0                       1                   0                 0      
       1                       0                   0                 0     
       0                       0                   1                 0      
       0                       0                   0                 1

我的代码是:

library(plyr)
library(stringr)
library(gdata) 
library(readxl)

train <- read_excel("data.xlsx")

#looping through the jobTitle column
for(i in 1:sum(nrow(train[4]))){ 
        if ((!is.na(train[i,4])) {
            #split every value by the comma, convert to lower case
            list2char <- strsplit(tolower(train$jobTitle[i]),",", fixed = T)
            for(j in 1:length(list2char[[1]])) {
                    #populate the current observation for the newly created column with 1
                    if(!(list2char[[1]][j] %in% names(train))){
                            #if the name does not match existing column name, create a new column and assign 1
                            train[i, str_trim(list2char[[1]][j])] <- 1
                    }else{
                            #if the name matches an existing column name, assign 1 to that column

                    }

            }

    }
}

#replace all NAs with 0s
train[is.na(train)] <- 0

1 个答案:

答案 0 :(得分:0)

我认为你试图用逗号分隔的字符串计算每个变量的频率?

    s<-data.frame(A=c("A1,B", "A2,C1"),B=c("B1,B2","C1,A1"), C=c("C1,C2,C3","C4"))
    #      A     B        C
    #1  A1,B B1,B2 C1,C2,C3
    #2 A2,C1 C1,A1       C4

    table( unlist(apply(s,1, function(s.row) {
       strsplit(s.row,",")
    })) )

    #A1 A2  B B1 B2 C1 C2 C3 C4 
    #2  1  1  1  1  3  1  1  1