我正在尝试以bz2格式导入压缩的json文件,并将其转换为数据框(链接到下面的文件和dput示例)。我使用这些代码行有点成功
library(jsonlite)
out <- lapply(readLines("RC_2005-12.bz2"), fromJSON)
df <- data.frame(matrix(unlist(out), nrow = length(out), byrow = T))
out
是命名条目的嵌套列表。但是,这些命名条目不在一个顺序中,因此df
中的列成为不同条目的混合。
如果我们使用下面的dput示例,那么在第一个列表中 controversiality 是第一个条目,而 utc_created 是第二个列表中的第一个条目。这会导致df
中的第一列看起来像:
X1
0
1134365725
这当然应该是一个两个零的列,对应于每个子列表的争议性。如何订购/排序/规范子列表以使列匹配?或者,当我将列表转换为df时,如何使用匹配的名称作为条件?
完整数据文件RC_2005-12.bz2
位于http://files.pushshift.io/reddit/comments/
以下out
的前两个子列表:
list(structure(list(controversiality = 0, body = "A look at Vietnam and Mexico exposes the myth of market liberalisation.",
subreddit_id = "t5_6", link_id = "t3_17863", stickied = FALSE,
subreddit = "reddit.com", score = 2, ups = 2, author_flair_css_class = NULL,
created_utc = 1134365188, author_flair_text = NULL, author = "frjo",
id = "c13", edited = FALSE, parent_id = "t3_17863", gilded = 0,
distinguished = NULL, retrieved_on = 1473738411), .Names = c("controversiality", "body", "subreddit_id", "link_id", "stickied", "subreddit", "score", "ups", "author_flair_css_class", "created_utc", "author_flair_text", "author", "id", "edited", "parent_id", "gilded", "distinguished", "retrieved_on")), structure(list(created_utc = 1134365725, author_flair_css_class = NULL, score = 1, ups = 1, subreddit = "reddit.com", stickied = FALSE, link_id = "t3_17866", subreddit_id = "t5_6", controversiality = 0, body = "The site states \"What can I use it for? Meeting notes, Reports, technical specs Sign-up sheets, proposals and much more...\", just like any other new breeed of sites that want us to store everything we have on the web. And they even guarantee multiple levels of security and encryption etc. But what prevents these web site operators fom accessing and/or stealing Meeting notes, Reports, technical specs Sign-up sheets, proposals and much more, for competitive or personal gains...? I am pretty sure that most of them are honest, but what's there to prevent me from setting up a good useful site and stealing all your data? Call me paranoid - I am.",
retrieved_on = 1473738411, distinguished = NULL, gilded = 0,
id = "c14", edited = FALSE, parent_id = "t3_17866", author = "zse7zse",
author_flair_text = NULL), .Names = c("created_utc", "author_flair_css_class", "score", "ups", "subreddit", "stickied", "link_id", "subreddit_id", controversiality", "body", "retrieved_on", "distinguished", "gilded", "id", "edited", "parent_id", "author", "author_flair_text" )))
答案 0 :(得分:3)
语料库中的read_ndjson
函数并不关心字段的显示顺序:
data <- corpus::read_ndjson(bzfile("RC_2005-12.bz2"))
需要修复的无关问题:
看起来制作此文件的人做错了。它是用UTF-8编码的,但他们认为它是Latin-1。见,例如,记录8:
data$body[8]
#> [1] "I donâ\u0080\u0099t know where they came up with this stuff, but Qube Web Search Client has taken the market by surprise. This is a cool concept thatâ\u0080\u0099s just beginning to blossom. You can save time by copying and pasting words and phrases."
通过首先撤消从他们认为的Latin-1到UTF-8的转换来修复它:
body <- iconv(data$body, "UTF-8", "Latin1")
然后设置正确的编码:
Encoding(body) <- "UTF-8"
检查结果:
body[8]
#> [1] "I don’t know where they came up with this stuff, but Qube Web Search Client has taken the market by surprise. This is a cool concept that’s just beginning to blossom. You can save time by copying and pasting words and phrases."
确保它有效:
all(utf8::utf8_valid(body))
#> TRUE
更改数据:
data$body <- body
您数据中的其他字段可能需要相同。
答案 1 :(得分:0)
您的文件似乎每行都有一个对象。我们可以稍微修改您的JSON以生成单个JSON数组,并让jsonlite::fromJSON
执行脏工作。类似的东西:
require(jsonlite)
lines<-paste0("[",paste(readLines("RC_2005-12.bz2"),collapse=","),"]")
fromJSON(lines)
#'data.frame': 1075 obs. of 18 variables:
# $ controversiality : int 0 0 0 0 0 0 0 0 0 0 ...
#...