Question

我有一个ndjson数据源。举一个简单的例子，考虑一个包含三行的文本文件，每行包含一个有效的json消息。我想从消息中提取7个变量并将它们放在数据帧中。

请在文本文件中使用以下示例数据。您可以将此数据粘贴到文本编辑器中并将其另存为＆＃34; ndjson_sample.txt＆＃34;

{"ts":"1","ct":"{\"Var1\":6,\"Var2\":6,\"Var3\":-70,\"Var4\":12353,\"Var5\":1,\"Var6\":\"abc\",\"Var7\":\"x\"}"}
{"ts":"2","ct":"{\"Var1\":6,\"Var2\":6,\"Var3\":-68,\"Var4\":4528,\"Var5\":1,\"Var6\":\"def\",\"Var7\":\"y\"}"}
{"ts":"3","ct":"{\"Var1\":6,\"Var2\":6,\"Var3\":-70,\"Var4\":-5409,\"Var5\":1,\"Var6\":\"ghi\",\"Var7\":\"z\"}"}

以下三行代码完成了我想要做的事情：

file1 <- "ndjson_sample.txt"
json_data1 <- ndjson::stream_in(file1)
raw_df_temp1 <- as.data.frame(ndjson::flatten(json_data1$ct))

由于我不能进入的原因，我无法使用ndjson包。我必须找到一种方法来使用jsonlite包来使用stream_in()和stream_out()函数执行相同的操作。这就是我的尝试：

con_in1 <- file(file1, open = "rt")
con_out1 <- file(tmp <- tempfile(), open = "wt")
callback_func <- function(df){
  jsonlite::stream_out(df, con_out1, pagesize = 1)
}
jsonlite::stream_in(con_in1, handler = callback_func, pagesize = 1)
close(con_out1)
con_in2 <- file(tmp, open = "rt")
raw_df_temp2 <- jsonlite::stream_in(con_in2)

这并没有给我与最终输出相同的数据框。你能告诉我我做错了什么以及我需要改变什么来使raw_df_temp1等于raw_df_temp2吗？

我可以通过在文件的每一行上运行fromJSON()函数来解决这个问题，但我想找到一种方法来使用stream函数。我将要处理的文件非常大，因此效率将是关键。我需要尽可能快。

提前谢谢。

Answer 1

目前在ct下你会找到一个字符串，它可以（随后）独立地提供给fromJSON，但不会被解析。忽略您的stream_out(stream_in(...),...)测试，可以通过以下几种方式阅读：

library(jsonlite)
json <- stream_in(file('ds_guy.ndjson'), simplifyDataFrame=FALSE)
# opening file input connection.
#  Imported 3 records. Simplifying...
# closing file input connection.
cbind(
  ts = sapply(json, `[[`, "ts"),
  do.call(rbind.data.frame, lapply(json, function(a) fromJSON(a$ct)))
)
#   ts Var1 Var2 Var3  Var4 Var5 Var6 Var7
# 1  1    6    6  -70 12353    1  abc    x
# 2  2    6    6  -68  4528    1  def    y
# 3  3    6    6  -70 -5409    1  ghi    z

在每个字符串上调用fromJSON可能很麻烦，而对于较大的数据，这会减慢为stream_in的原因，因此如果我们可以将"ct"组件捕获到一个它自己的，然后......

writeLines(sapply(json, `[[`, "ct"), 'ds_guy2.ndjson')

（使用非R工具有更高效的方法，包括简单的

sed -e 's/.*"ct":"\({.*\}\)"}$/\1/g' -e 's/\\"/"/g' ds_guy.ndjson > ds_guy.ndjson2

尽管这会对可能不太安全的数据做出一些假设。一个更好的解决方案是使用jq，它应该“始终”正确地解析正确的json，然后快速sed来替换转义的引号：

jq '.ct' ds_guy.ndjson | sed -e 's/\\"/"/g' > ds_guy2.ndjson

如果需要，您可以在R中使用system(...)执行此操作。）

从那里开始，假设每行只包含一行data.frame数据：

json2 <- stream_in(file('ds_guy2.ndjson'), simplifyDataFrame=TRUE)
# opening file input connection.
#  Imported 3 records. Simplifying...
# closing file input connection.
cbind(ts=sapply(json, `[[`, "ts"), json2)
#   ts Var1 Var2 Var3  Var4 Var5 Var6 Var7
# 1  1    6    6  -70 12353    1  abc    x
# 2  2    6    6  -68  4528    1  def    y
# 3  3    6    6  -70 -5409    1  ghi    z

注意：在第一个示例中，"ts"是factor，其他所有都是character，因为这是fromJSON给出的内容。在第二个示例中，所有字符串都是factor。根据您的需要，可以通过明智地使用stringsAsFactors=FALSE来轻松解决这个问题。

需要使用jsonlite来使用stream_in（）和stream_out（）来处理ndjson消息列表

1 个答案: