EDIT 2014-05-01:我首先尝试从JSON开始(如下所示),但只解析了第一行。我发现每个JSON行的括号之间都有逗号,所以我在TextEdit中更改了它并保存了文件。我还添加了[在文件的开头和]结尾,然后它使用了JSON。现在进行下一步:从列表(带有嵌入列表)到数据帧(或csv)。
我不时地从edX获得一个数据包,我们正在评估的课程。其中一些只是普通的.csv文件,很容易处理,有些对我来说比较困难(没有CS或编程背景)。
我有2个文件要打开并解析成csv文件以便在R中进行分析。我已经尝试了很多json2csv工具,但无济于事。我还尝试了这里描述的简单方法将json转换为csv。
数据是保密的,所以我无法共享整个数据集,但会分享文件的前两行,这可能会有所帮助。问题是,我找不到任何关于.mongo文件的东西,这对我来说似乎很奇怪,它们甚至存在吗?或者这只是一个可能已损坏的JSON文件(可以解释错误)?
欢迎任何建议。
其中一个.mongo文件中的前两行:
{
"_id": {
"$oid": "52d1e62c350e7a3156000009"
},
"votes": {
"up": [
],
"down": [
],
"up_count": 0,
"down_count": 0,
"count": 0,
"point": 0
},
"visible": true,
"abuse_flaggers": [
],
"historical_abuse_flaggers": [
],
"parent_ids": [
],
"at_position_list": [
],
"body": "the delft university accredited course with the scholarship (fundamentals of water treatment) is supposed to start in about a month's time. But have the scholarship list been published? Any tentative date??",
"course_id": "DelftX/CTB3365x/2013_Fall",
"_type": "Comment",
"endorsed": false,
"anonymous": false,
"anonymous_to_peers": false,
"author_id": "269835",
"comment_thread_id": {
"$oid": "52cd40c5ab40cf347e00008d"
},
"author_username": "tachak59",
"sk": "52d1e62c350e7a3156000009",
"updated_at": {
"$date": 1389487660636
},
"created_at": {
"$date": 1389487660636
}
}{
"_id": {
"$oid": "52d0a66bcb3eee318d000012"
},
"votes": {
"up": [
],
"down": [
],
"up_count": 0,
"down_count": 0,
"count": 0,
"point": 0
},
"visible": true,
"abuse_flaggers": [
],
"historical_abuse_flaggers": [
],
"parent_ids": [
{
"$oid": "52c63278100c07c0d1000028"
}
],
"at_position_list": [
],
"body": "I got it. Thank you!",
"course_id": "DelftX/CTB3365x/2013_Fall",
"_type": "Comment",
"endorsed": false,
"anonymous": false,
"anonymous_to_peers": false,
"parent_id": {
"$oid": "52c63278100c07c0d1000028"
},
"author_id": "2655027",
"comment_thread_id": {
"$oid": "52c4f303b03c4aba51000013"
},
"author_username": "dmoronta",
"sk": "52c63278100c07c0d1000028-52d0a66bcb3eee318d000012",
"updated_at": {
"$date": 1389405803386
},
"created_at": {
"$date": 1389405803386
}
}{
"_id": {
"$oid": "52ceea0cada002b72c000059"
},
"votes": {
"up": [
],
"down": [
],
"up_count": 0,
"down_count": 0,
"count": 0,
"point": 0
},
"visible": true,
"abuse_flaggers": [
],
"historical_abuse_flaggers": [
],
"parent_ids": [
{
"$oid": "5287e8d5906c42f5aa000013"
}
],
"at_position_list": [
],
"body": "if u please send by mail \n",
"course_id": "DelftX/CTB3365x/2013_Fall",
"_type": "Comment",
"endorsed": false,
"anonymous": false,
"anonymous_to_peers": false,
"parent_id": {
"$oid": "5287e8d5906c42f5aa000013"
},
"author_id": "2276302",
"comment_thread_id": {
"$oid": "528674d784179607d0000011"
},
"author_username": "totah1993",
"sk": "5287e8d5906c42f5aa000013-52ceea0cada002b72c000059",
"updated_at": {
"$date": 1389292044203
},
"created_at": {
"$date": 1389292044203
}
}
答案 0 :(得分:3)
R没有" native"支持这些文件,但有一个带有rjson包的JSON解析器。所以我可以加载我的.mongo
文件:
myfile <- "path/to/myfile.mongo"
myJSON <- readLines(myfile)
myNiceData <- fromJSON(myJSON)
由于RJson转换为适合正在读取的对象的数据结构,因此您必须进行一些额外的监听,但是一旦您拥有R数据类型,您就不应该在那里使用它时遇到任何问题
解析JSON数据时要考虑的另一个包是jsonlite。它将为您创建数据框,以便您可以使用write.table
或其他一些适用于编写对象的方法将它们写入csv格式。
注意:如果更容易连接到MongoDB并从请求中获取数据,那么RMongo可能是一个不错的选择。 R-Bloggers也提出了一个关于使用RMongo的post,它有一个很好的小练习。
答案 1 :(得分:0)
我按照@theWanderer的建议使用了RJSON,并在同事的帮助下编写了以下代码,将数据解析为列,选择所需的特定列,并检查每个实例是否返回正确的变量。
整个工作流程:
将.mongo文件导入R然后将其解析为CSV的代码:
library(rjson)
###### set working directory to write out the data file
setwd("/your/favourite/dir/json to csv/")
#never ever convert strings to factors
options(stringsAsFactors = FALSE)
#import the .mongo file to R
temp.data = fromJSON(file="temp.mongo", method="C", unexpected.escape="error")
file.remove("temp.csv") ## removes the old datafile if there is one
## (so the data is not appended to the file,
## but a new file is created)
listItem = temp.data[[1]] ## prepare the listItem the first time
for (listItem in temp.data){
parent_id = ""
if (length(listItem$parent_id)>0){
parent_id = listItem$parent_id
}
write.table(t(c(
listItem$votes$up_count, listItem$visible, parent_id,
gsub("\n", "", listItem$body), listItem$course_id, unlist(listItem["_type"]),
listItem$endorsed, listItem$anonymous, listItem$author_id,
unlist(listItem$comment_thread_id), listItem$author_username,
as.POSIXct(unlist(listItem$created_at)/1000, origin="1970-01-01"))), # end t(), c()
file="temp.csv", sep="\t", append=TRUE, row.names=FALSE, col.names=FALSE)
}