Question

EDIT 2014-05-01：我首先尝试从JSON开始（如下所示），但只解析了第一行。我发现每个JSON行的括号之间都有逗号，所以我在TextEdit中更改了它并保存了文件。我还添加了[在文件的开头和]结尾，然后它使用了JSON。现在进行下一步：从列表（带有嵌入列表）到数据帧（或csv）。

我不时地从edX获得一个数据包，我们正在评估的课程。其中一些只是普通的.csv文件，很容易处理，有些对我来说比较困难（没有CS或编程背景）。

我有2个文件要打开并解析成csv文件以便在R中进行分析。我已经尝试了很多json2csv工具，但无济于事。我还尝试了这里描述的简单方法将json转换为csv。

数据是保密的，所以我无法共享整个数据集，但会分享文件的前两行，这可能会有所帮助。问题是，我找不到任何关于.mongo文件的东西，这对我来说似乎很奇怪，它们甚至存在吗？或者这只是一个可能已损坏的JSON文件（可以解释错误）？

欢迎任何建议。

其中一个.mongo文件中的前两行：

{
    "_id": {
        "$oid": "52d1e62c350e7a3156000009"
    },
    "votes": {
        "up": [

        ],
        "down": [

        ],
        "up_count": 0,
        "down_count": 0,
        "count": 0,
        "point": 0
    },
    "visible": true,
    "abuse_flaggers": [

    ],
    "historical_abuse_flaggers": [

    ],
    "parent_ids": [

    ],
    "at_position_list": [

    ],
    "body": "the delft university accredited course with the scholarship (fundamentals of water treatment) is supposed to start in about a month's time. But have the scholarship list been published? Any tentative date??",
    "course_id": "DelftX/CTB3365x/2013_Fall",
    "_type": "Comment",
    "endorsed": false,
    "anonymous": false,
    "anonymous_to_peers": false,
    "author_id": "269835",
    "comment_thread_id": {
        "$oid": "52cd40c5ab40cf347e00008d"
    },
    "author_username": "tachak59",
    "sk": "52d1e62c350e7a3156000009",
    "updated_at": {
        "$date": 1389487660636
    },
    "created_at": {
        "$date": 1389487660636
    }
}{
    "_id": {
        "$oid": "52d0a66bcb3eee318d000012"
    },
    "votes": {
        "up": [

        ],
        "down": [

        ],
        "up_count": 0,
        "down_count": 0,
        "count": 0,
        "point": 0
    },
    "visible": true,
    "abuse_flaggers": [

    ],
    "historical_abuse_flaggers": [

    ],
    "parent_ids": [
        {
            "$oid": "52c63278100c07c0d1000028"
        }
    ],
    "at_position_list": [

    ],
    "body": "I got it. Thank you!",
    "course_id": "DelftX/CTB3365x/2013_Fall",
    "_type": "Comment",
    "endorsed": false,
    "anonymous": false,
    "anonymous_to_peers": false,
    "parent_id": {
        "$oid": "52c63278100c07c0d1000028"
    },
    "author_id": "2655027",
    "comment_thread_id": {
        "$oid": "52c4f303b03c4aba51000013"
    },
    "author_username": "dmoronta",
    "sk": "52c63278100c07c0d1000028-52d0a66bcb3eee318d000012",
    "updated_at": {
        "$date": 1389405803386
    },
    "created_at": {
        "$date": 1389405803386
    }
}{
    "_id": {
        "$oid": "52ceea0cada002b72c000059"
    },
    "votes": {
        "up": [

        ],
        "down": [

        ],
        "up_count": 0,
        "down_count": 0,
        "count": 0,
        "point": 0
    },
    "visible": true,
    "abuse_flaggers": [

    ],
    "historical_abuse_flaggers": [

    ],
    "parent_ids": [
        {
            "$oid": "5287e8d5906c42f5aa000013"
        }
    ],
    "at_position_list": [

    ],
    "body": "if u please send by mail \n",
    "course_id": "DelftX/CTB3365x/2013_Fall",
    "_type": "Comment",
    "endorsed": false,
    "anonymous": false,
    "anonymous_to_peers": false,
    "parent_id": {
        "$oid": "5287e8d5906c42f5aa000013"
    },
    "author_id": "2276302",
    "comment_thread_id": {
        "$oid": "528674d784179607d0000011"
    },
    "author_username": "totah1993",
    "sk": "5287e8d5906c42f5aa000013-52ceea0cada002b72c000059",
    "updated_at": {
        "$date": 1389292044203
    },
    "created_at": {
        "$date": 1389292044203
    }
}

Answer 1

R没有＆＃34; native＆＃34;支持这些文件，但有一个带有rjson包的JSON解析器。所以我可以加载我的.mongo文件：

myfile <- "path/to/myfile.mongo"
myJSON <- readLines(myfile)
myNiceData <- fromJSON(myJSON)

由于RJson转换为适合正在读取的对象的数据结构，因此您必须进行一些额外的监听，但是一旦您拥有R数据类型，您就不应该在那里使用它时遇到任何问题

解析JSON数据时要考虑的另一个包是jsonlite。它将为您创建数据框，以便您可以使用write.table或其他一些适用于编写对象的方法将它们写入csv格式。

注意：如果更容易连接到MongoDB并从请求中获取数据，那么RMongo可能是一个不错的选择。 R-Bloggers也提出了一个关于使用RMongo的post，它有一个很好的小练习。

Answer 2

我按照@theWanderer的建议使用了RJSON，并在同事的帮助下编写了以下代码，将数据解析为列，选择所需的特定列，并检查每个实例是否返回正确的变量。

整个工作流程：

检查了jsonlint中的一些数据 - 更正了错误→}，{而不是} {在每一行之间和[和]在文件的开头和结尾
制作一个较小的文件，包含大约11条JSON行
使用下面的代码来解析数据文件 - 但是，如果它们不是列表本身（这会产生问题），首先检查不同的listItems //正如您将看到的那样，我还删除了类似于\ n的内容，因为它给出了错误并添加了如果数据中没有，则为parent_id的空值（否则会混淆数据）

将.mongo文件导入R然后将其解析为CSV的代码：

library(rjson)

###### set working directory to write out the data file
setwd("/your/favourite/dir/json to csv/")

#never ever convert strings to factors
options(stringsAsFactors = FALSE)
#import the .mongo file to R
temp.data = fromJSON(file="temp.mongo", method="C", unexpected.escape="error")

file.remove("temp.csv") ## removes the old datafile if there is one
                        ## (so the data is not appended to the file,
                        ## but a new file is created)

listItem = temp.data[[1]] ## prepare the listItem the first time

for (listItem in temp.data){
  parent_id = ""
  if (length(listItem$parent_id)>0){
    parent_id = listItem$parent_id
  }
write.table(t(c(
    listItem$votes$up_count, listItem$visible, parent_id,
    gsub("\n", "", listItem$body), listItem$course_id, unlist(listItem["_type"]),
    listItem$endorsed, listItem$anonymous, listItem$author_id, 
    unlist(listItem$comment_thread_id), listItem$author_username, 
    as.POSIXct(unlist(listItem$created_at)/1000, origin="1970-01-01"))), # end t(), c()
    file="temp.csv", sep="\t", append=TRUE, row.names=FALSE, col.names=FALSE)
}

如何打开.mongo文件并将内容导出到csv？

2 个答案: