我有一个JSON文件(从mongoDB导出),我想将其加载到R中。文档大小约为890 MB,大约有63,000行,包含12个字段。字段是数字,字符和日期。我想最终得到一个63000 x 12数据帧。
lines <- readLines("fb2013.json")
结果:jFile在char类中包含所有63,000个元素,并且所有字段都集中在一个字段中。
每个文件看起来像这样:
“{\”_ id \“:\”10151271769737669 \“,\”comments_count \“:36,\”created_at \“:{\”$ date \“:1357941938000},\”icon \“:\” http://blahblah.gif \“,\”likes_count \“:450,\”link \“:\”http://www.blahblahblah.php \“,\”消息\“:\”我希望我能想出这个!\“,\”page_category \“:\”Computers \“,\”page_id \“:\”30968999999 \“,\”page_name \“:\”NothingButTrouble \“,\”type \“:\ “photo \”,\“updated_at \”:{\“$ date \”:1358210153000}}“
使用rjson,
jFile <- fromJSON(paste(readLines("fb2013.json"), collapse=""))
只有第一行被读入jFile,但有12个字段。
使用RJSONIO:
jFile <- fromJSON(lines)
导致以下结果:
Warning messages:
1: In if (is.na(encoding)) return(0L) :
the condition has length > 1 and only the first element will be used
同样,只有第一行被读入jFile,并且有12个字段。
rjson和RJSONIO的输出看起来像这样:
$`_id`
[1] "1018535"
$comments_count
[1] 0
$created_at
$date
1.357027e+12
$icon
[1] "http://blah.gif"
$likes_count
[1] 20
$link
[1] "http://www.chachacha"
$message
[1] "I'd love to figure this out."
$page_category
[1] "Internet/software"
$page_id
[1] "3924395872345878534"
$page_name
[1] "Not Entirely Hopeless"
$type
[1] "photo"
$updated_at
$date
1.357027e+12
答案 0 :(得分:8)
试
library(rjson)
path <- "WHERE/YOUR/JSON/IS/SAVED"
c <- file(path, "r")
l <- readLines(c, -1L)
json <- lapply(X=l, fromJSON)
答案 1 :(得分:3)
既然你想要一个data.frame,试试这个:
# three copies of your sample...
line.1<- "{ \"_id\" : \"10151271769737669\", \"comments_count\" : 36, \"created_at\" : { \"$date\" : 1357941938000 }, \"icon\" : \"http://blahblah.gif\", \"likes_count\" : 450, \"link\" : \"http://www.blahblahblah.php\", \"message\" : \"I wish I could figure this out!\", \"page_category\" : \"Computers\", \"page_id\" : \"30968999999\", \"page_name\" : \"NothingButTrouble\", \"type\" : \"photo\", \"updated_at\" : { \"$date\" : 1358210153000 } }"
line.2<- "{ \"_id\" : \"10151271769737669\", \"comments_count\" : 36, \"created_at\" : { \"$date\" : 1357941938000 }, \"icon\" : \"http://blahblah.gif\", \"likes_count\" : 450, \"link\" : \"http://www.blahblahblah.php\", \"message\" : \"I wish I could figure this out!\", \"page_category\" : \"Computers\", \"page_id\" : \"30968999999\", \"page_name\" : \"NothingButTrouble\", \"type\" : \"photo\", \"updated_at\" : { \"$date\" : 1358210153000 } }"
line.3<- "{ \"_id\" : \"10151271769737669\", \"comments_count\" : 36, \"created_at\" : { \"$date\" : 1357941938000 }, \"icon\" : \"http://blahblah.gif\", \"likes_count\" : 450, \"link\" : \"http://www.blahblahblah.php\", \"message\" : \"I wish I could figure this out!\", \"page_category\" : \"Computers\", \"page_id\" : \"30968999999\", \"page_name\" : \"NothingButTrouble\", \"type\" : \"photo\", \"updated_at\" : { \"$date\" : 1358210153000 } }"
x <- paste(line.1, line.2, line.3, sep="\n")
lines <- readLines(textConnection(x))
library(rjson)
# this is the important bit
df <- data.frame(do.call(rbind,lapply(lines,fromJSON)))
ncol(df)
# [1] 12
# finally, there's some cleaning up to do...
df$created_at
# [[1]]
# [[1]]$`$date`
# [1] 1.357942e+12
# ...
df$created_at <- as.POSIXlt(unname(unlist(df$created_at)/1000),origin="1970-01-01")
df$created_at
# [1] "2013-01-11 17:05:38 EST" "2013-01-11 17:05:38 EST" "2013-01-11 17:05:38 EST"
df$updated_at <- as.POSIXlt(unname(unlist(df$updated_at)/1000),origin="1970-01-01")
请注意,此转化假设日期自纪元以来以毫秒存储。