建议在CSV文件中读取JSON而不在R中解析

时间:2017-05-14 16:49:07

标签: r

我有一个CSV文件,里面还有JSON。我试图将公司,类型,驱动程序放入数据框中 我不想解析它,因为下面的CSV只是一个示例,我有更多的列具有各种json键/值(一些不存在,不按特定顺序,并且很多很多)。

我的sub-short_csvjson.csv CSV文件样本:

Married,Transportation,Color
YES,"{""Company"":""GTS"",""Type"":""Limo""}",White
,"{""Driver"":""John""}",Green
NO,"{""Type"":""Van"",""Driver"":""John""}",

我可以做什么(在解析之外)使用

创建数据框
my_data$Married
my_data$Transportation.Company
my_data$Transportation.Type
my_data$Transportation.Driver
my_data$Color

由于

1 个答案:

答案 0 :(得分:1)

这是我能想到的一个解决方案,它使用jsonlite包和逐行处理为您提供所需的解决方案:

ASSUME df使用read.csvstringsAsFactors = FALSE看起来像这样:

df
  Married                  Transportation Color
1     YES {"Company":"GTS","Type":"Limo"} White
2                       {"Driver":"John"} Green
3      NO  {"Type":"Van","Driver":"John"}      

你可以这样做:

library(jsonlite)

l <- lapply(df$Transportation, fromJSON)
n <- unique(unlist(sapply(l, names)))
df[, n] <- lapply(n, function(x) sapply(l, function(y) y[[x]]))

要得到这个:

df
  Married                  Transportation Color Company Type Driver
1     YES {"Company":"GTS","Type":"Limo"} White     GTS Limo   NULL
2                       {"Driver":"John"} Green    NULL NULL   John
3      NO  {"Type":"Van","Driver":"John"}          NULL  Van   John

不确定是否有更有效的方式。

基于在实际数据中添加有关MALFORMED JSON的信息进行编辑

如果Transportation列中的原始版本中存在格式错误的JSON,则可以采用以下方法解决此问题:

原始数据框如下:

df <- read.table(text = 'Married,Transportation,Color
YES,"{""Company"":""GTS"",""Type"":""Limo""}",White
,"{""Driver"":""John""}",Green
NO,"{""Type"":""Van"",""Driver"":""John""}",',
header = TRUE, sep = ',', stringsAsFactors = FALSE)

行绑定和额外行,格式错误的JSON带有额外的&#39;&#39;&#39;&#39;字符:

df <- rbind(df, data.frame(Married = 'NO',
                           Transportation = '{"Company": ""GTLS"}',
                           Color = 'Red'))

新df看起来像这样(请参阅第4行中格式错误的JSON):

  Married                  Transportation Color
1     YES {"Company":"GTS","Type":"Limo"} White
2                       {"Driver":"John"} Green
3      NO  {"Type":"Van","Driver":"John"}      
4      NO            {"Company": ""GTLS"}   Red

现在,使用它将所有嵌套的JSON分成不同的列:

l <- lapply(df$Transportation, function(x) tryCatch({fromJSON(x)}, error = function(e) NA))
n <- unique(unlist(sapply(l, names)))
df[, n] <- lapply(n, function(x)
                    sapply(l, function(y)
                             if (!is.null(names(y))) y[[x]]))

输出如下:

  Married                  Transportation Color Company Type Driver
1     YES {"Company":"GTS","Type":"Limo"} White     GTS Limo   NULL
2                       {"Driver":"John"} Green    NULL NULL   John
3      NO  {"Type":"Van","Driver":"John"}          NULL  Van   John
4      NO            {"Company": ""GTLS"}   Red    NULL NULL   NULL