Question

我使用以下代码读取一个非常大的json文件（90GB）：

library(jsonlite)
library(dplyr)
con_in <- file("events.json")
con_out <- file("event-frequencies1.json", open = "wb")
stream_in(con_in, handler = function(df) {
    df <- df[df$`rill.message/cursor` > 23000000, ]
    stream_out(df, con_out)
})
close(con_out)

我的代码可以正常工作，但问题是我需要来自文件中间的数据，但要使用上面的代码到达文件的中间需要数小时。有没有办法从某个偏移开始读取/处理文件（比如文件的中间）？我在考虑起始行号或字节偏移量？

如果它不能与stream_in（）一起使用，那么处理这样一个大文件的最佳方法是什么？我需要从这个JSON中选择某些行，并将它放入数据帧中，或者用选定的行创建一个更小的JSON？

Answer 1

您应该能够seek()在文件连接上开始阅读您喜欢的任何字节。例如

con_in <- file("myfile.json")
open(con_in)
# skip ahead 300 bytes
seek(con_in,300)
# read till end of line so stream_in will start on a fresh new line
throwaway <- readLines(con_in,1) 

stream_in(con_in, handler = function(df) {
    print(df)
})

close(con_in)

如何从文件的开头开始stream_in（）

1 个答案: