根据变量将文本字符串拆分为列

时间:2017-03-15 10:30:18

标签: r regex text dataframe

我有一个带有文本列的数据框,我想将其拆分成多列,因为文本字符串包含多个变量,例如位置,教育,距离等。

数据帧:

text.string = c("&location=NY&distance=30&education=University", 
                "&location=CA&distance=30&education=Highschool&education=University", 
                "&location=MN&distance=10&industry=Healthcare", 
                "&location=VT&distance=30&education=University&industry=IT&industry=Business") 

df = data.frame(text.string)
df


                                                                  text.string
1                               &location=NY&distance=30&education=University
2          &location=CA&distance=30&education=Highschool&education=University
3                                &location=MN&distance=10&industry=Healthcare
4 &location=VT&distance=30&education=University&industry=IT&industry=Business

我可以使用cSplitcSplit(df, 'text.string', sep = "&")

拆分
   text.string_1 text.string_2 text.string_3        text.string_4        text.string_5     text.string_6
1:            NA   location=NY   distance=30 education=University                   NA                NA
2:            NA   location=CA   distance=30 education=Highschool education=University                NA
3:            NA   location=MN   distance=10  industry=Healthcare                   NA                NA
4:            NA   location=VT   distance=30 education=University          industry=IT industry=Business

问题是文本字符串可能包含相同变量的倍数,或者某些变量缺少某些变量。使用cSplit,每列的变量分组变得混杂起来。我想避免这种情况,并将它们组合在一起。

所以它与此类似(educationindustry不再出现在多个列中):

  text.string_1 text.string_2 text.string_3                             text.string_4                 text.string_5 text.string_6
1            NA   location=NY   distance=30                      education=University                          <NA>            NA
2            NA   location=CA   distance=30 education=Highschool education=University                          <NA>            NA
3            NA   location=MN   distance=10                                      <NA>           industry=Healthcare            NA
4            NA   location=VT   distance=30                      education=University  industry=IT industry=Business            NA

1 个答案:

答案 0 :(得分:1)

考虑到@NicE评论: 这是一种方式,遵循您的示例:

library(data.table)
       text.string = c("&location=NY&distance=30&education=University", 
                    "&location=CA&distance=30&education=Highschool&education=University", 
                    "&location=MN&distance=10&industry=Healthcare", 
                    "&location=VT&distance=30&education=University&industry=IT&industry=Business") 

    clean <- strsplit(text.string, "&|=")
    out <- lapply(clean, function(x){ma <- data.table(matrix(x[!x==""], nrow = 2, byrow = F ));
    setnames(ma, as.character(ma[1,]));
    ma[-1,]})

    out <- rbindlist(out, fill = T)
    out
       location distance  education  education   industry industry
    1:       NY       30 University         NA         NA       NA
    2:       CA       30 Highschool University         NA       NA
    3:       MN       10         NA         NA Healthcare       NA
    4:       VT       30 University         NA         IT Business