将复杂的字符串列解析为R中的新列

时间:2015-01-15 20:46:38

标签: r parsing

我有以下数据:

id,response,date
123,{"showAgain":1421547783703,"answer":null,"details":null,"user_id":2423553}, 2015-01-11 02:23:03
124,{"showAgain":1421683620119,"answer":["Never"],"details":null,"user_id":4933822,"company_id":992211,"category":"apple"}, 2015-01-12 16:06:56
125,{"showAgain":1421692043509,"answer":["Sometimes","other"],"details":"I like bread.","user_id":2390922,"company_id":119988,"category":"banana"},2015-01-12 18:27:23

要清楚,“响应”列值是您在大括号中看到的。

我需要将响应分解为新列,但字符串并不总是具有相同数量的值。期望的输出是这样的:

id,answer,details,user_id,company_id,category,date
123,NA,NA,2423553,NA,NA,2015-01-11 02:23:03
124,Never,NA,4933822,992211,apple,2015-01-12 16:06:56
125,Other,"I like bread",2390922,119988,banana,2015-01-12 18:27:23

NA也可以是空白或NULL,我无动于衷。 在第3行,“回答”也可以是两个回复“有时。其他”的串联。或者它可以分解为一个名为answer2的新列。传入的“答案”字段中永远不会有超过2个值(95%的时间它将是1个值)。

欢迎任何有关如何处理这一问题的线索。

1 个答案:

答案 0 :(得分:1)

这是一个开始:

library(stringr)
library(dplyr)
library(jsonlite)
library(data.table)

lines <- readLines("data.txt")

build_cols <- function(x) {
  data.frame(cbind(id=x[2], date=x[4], rbind(fromJSON(x[3]))))
}

rbindlist(lapply(str_match_all(lines[2:length(lines)], 
                               "([[:digit:]]+),(\\{.*\\}),(.*$)"),
                 build_cols), fill=TRUE) %>%
  select(id,answer,details,user_id,company_id,category,date)

##     id          answer       details user_id company_id category                 date
## 1: 123            NULL          NULL 2423553       NULL     NULL  2015-01-11 02:23:03
## 2: 124           Never          NULL 4933822     992211    apple  2015-01-12 16:06:56
## 3: 125 Sometimes,other I like bread. 2390922     119988   banana  2015-01-12 18:27:23