我有以下数据:
id,response,date
123,{"showAgain":1421547783703,"answer":null,"details":null,"user_id":2423553}, 2015-01-11 02:23:03
124,{"showAgain":1421683620119,"answer":["Never"],"details":null,"user_id":4933822,"company_id":992211,"category":"apple"}, 2015-01-12 16:06:56
125,{"showAgain":1421692043509,"answer":["Sometimes","other"],"details":"I like bread.","user_id":2390922,"company_id":119988,"category":"banana"},2015-01-12 18:27:23
要清楚,“响应”列值是您在大括号中看到的。
我需要将响应分解为新列,但字符串并不总是具有相同数量的值。期望的输出是这样的:
id,answer,details,user_id,company_id,category,date
123,NA,NA,2423553,NA,NA,2015-01-11 02:23:03
124,Never,NA,4933822,992211,apple,2015-01-12 16:06:56
125,Other,"I like bread",2390922,119988,banana,2015-01-12 18:27:23
NA也可以是空白或NULL,我无动于衷。 在第3行,“回答”也可以是两个回复“有时。其他”的串联。或者它可以分解为一个名为answer2的新列。传入的“答案”字段中永远不会有超过2个值(95%的时间它将是1个值)。
欢迎任何有关如何处理这一问题的线索。
答案 0 :(得分:1)
这是一个开始:
library(stringr)
library(dplyr)
library(jsonlite)
library(data.table)
lines <- readLines("data.txt")
build_cols <- function(x) {
data.frame(cbind(id=x[2], date=x[4], rbind(fromJSON(x[3]))))
}
rbindlist(lapply(str_match_all(lines[2:length(lines)],
"([[:digit:]]+),(\\{.*\\}),(.*$)"),
build_cols), fill=TRUE) %>%
select(id,answer,details,user_id,company_id,category,date)
## id answer details user_id company_id category date
## 1: 123 NULL NULL 2423553 NULL NULL 2015-01-11 02:23:03
## 2: 124 Never NULL 4933822 992211 apple 2015-01-12 16:06:56
## 3: 125 Sometimes,other I like bread. 2390922 119988 banana 2015-01-12 18:27:23