如何将这种非结构化数据转换为结构化数据?

时间:2017-04-04 00:45:07

标签: r

我看起来像这样的数据:

data <- c("24-March-2017       text1                         874874455221112                Text text text10",
  "25-March-2017       text2                          54654656TEXT                  Text text 11",
  "24-March-2017       text3                          874874455221112               Text text text 12",
  "25-March-2017                  text4                         54654656TEXT                    Text text  13",
  "26-March-2017     text3              54654TEXT   Text text text  14",
  "27-March-2017                text5                       6546TEXT    Text text text 15",
  "28-March-2017      text6                          546476876586TExt   Text text text 16",
  "29-March-2017                  text7      23453453TEXT     Text text  17") 

enter image description here

我想根据每列之间的空格将此数据转换为结构化格式。前三行看起来与我希望数据看起来完全一样。最终结果需要看起来像:

enter image description here

基本上:

  1. 第一列(日期)从零开始(无需更改)
  2. 第二列必须从第20位开始
  3. 第三列从第50位开始
  4. 最后一列从80开始

2 个答案:

答案 0 :(得分:3)

do.call('rbind', lapply( df, function( x ) {  # loop through vector df
  x <- strsplit( x, "\ ")[[1]]                # split string by spaces
  x <- x[which( unlist( lapply(x, nchar) ) > 0 )]  # remove zero length strings
  x <- c(x[1:3], paste( x[4:length(x)], collapse = " ") )  # collapse all elements from 4 to end
  return( x)  # return formatted vector
}))

#                 [,1]    [,2]             [,3]                 [,4]               
# [1,] "24-March-2017" "text1" "874874455221112"  "Text text text10" 
# [2,] "25-March-2017" "text2" "54654656TEXT"     "Text text 11"     
# [3,] "24-March-2017" "text3" "874874455221112"  "Text text text 12"
# [4,] "25-March-2017" "text4" "54654656TEXT"     "Text text 13"     
# [5,] "26-March-2017" "text3" "54654TEXT"        "Text text text 14"
# [6,] "27-March-2017" "text5" "6546TEXT"         "Text text text 15"
# [7,] "28-March-2017" "text6" "546476876586TExt" "Text text text 16"
# [8,] "29-March-2017" "text7" "23453453TEXT"     "Text text 17"  

基于@thelatemail评论

df <- read.table(text=df,fill=TRUE,header=FALSE)
df[, 4] <- apply( df[, 4:ncol(df)], 1, function( x ) {
  paste( x[ ! is.na( x ) ], collapse = ' ') } )
df <- df[, 1:4]
df
#              V1    V2               V3                V4
# 1 24-March-2017 text1  874874455221112  Text text text10
# 2 25-March-2017 text2     54654656TEXT      Text text 11
# 3 24-March-2017 text3  874874455221112 Text text text 12
# 4 25-March-2017 text4     54654656TEXT      Text text 13
# 5 26-March-2017 text3        54654TEXT Text text text 14
# 6 27-March-2017 text5         6546TEXT Text text text 15
# 7 28-March-2017 text6 546476876586TExt Text text text 16
# 8 29-March-2017 text7     23453453TEXT      Text text 17

数据:

df <- c("24-March-2017       text1                         874874455221112                Text text text10",
          "25-March-2017       text2                          54654656TEXT                  Text text 11",
          "24-March-2017       text3                          874874455221112               Text text text 12",
          "25-March-2017                  text4                         54654656TEXT                    Text text  13",
          "26-March-2017     text3              54654TEXT   Text text text  14",
          "27-March-2017                text5                       6546TEXT    Text text text 15",
          "28-March-2017      text6                          546476876586TExt   Text text text 16",
          "29-March-2017                  text7      23453453TEXT     Text text  17") 

答案 1 :(得分:3)

这是基于给定的数据,并假定:

  • 有四列
  • 前三个内部没有空格,并以空格分隔
  • 最后一列可能包含空格

它将匹配的子字符串rbind拉出到矩阵中,删除全局匹配,转换为data.frame然后通过sprintf以获得固定宽度输出。

data %>%
  regmatches(regexec("^(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(.*?)$", .)) %>%
  do.call("rbind", .) %>%
  .[, -1] %>%
  as.data.frame(stringsAsFactors = FALSE) %>%
  c(list("%-20s%-30s%-30s%s"), .) %>%
  do.call("sprintf", .)

# [1] "24-March-2017       text1                         874874455221112               Text text text10"  
# [2] "25-March-2017       text2                         54654656TEXT                  Text text 11"      
# [3] "24-March-2017       text3                         874874455221112               Text text text 12" 
# [4] "25-March-2017       text4                         54654656TEXT                  Text text  13"     
# [5] "26-March-2017       text3                         54654TEXT                     Text text text  14"
# [6] "27-March-2017       text5                         6546TEXT                      Text text text 15" 
# [7] "28-March-2017       text6                         546476876586TExt              Text text text 16" 
# [8] "29-March-2017       text7                         23453453TEXT                  Text text  17"