我要处理大量的.bz2,这比我的RAM大。因此,我无法读取整个文件。
在R-Studio中,我使用以下代码,这使我可以处理批处理。
#libraries
library(dplyr)
library(jsonlite)
#connection to output file
con_out = file("test3.json", open ="w")
#open input connection
o <- 1 #file number
file<-bzfile(description= paste("D:/data/", o, ".bz2", sep =""), open = "r", encoding = getOption("encoding"),
compression = 9)
#streaming and filtering data
df<- stream_in(con=file ,handler = function(df)
{
df <- dplyr::filter(df, ups > 1)
stream_out(df, con = con_out)
}
,pagesize = 10000, verbose = TRUE)
}
close(file)
现在我想知道您是否可以在python中做同样的事情? 我编写了以下代码,逐行处理文件。但这很慢。
directory = os.fsencode("D:\\data")
for file in os.listdir(directory):
filename = os.fsdecode(file)
if filename.endswith("bz2"):
filename = str("D:\\data\\" + filename)
filename_json = str(filename[:-3] + "json")
with bz2.open(filename, "rt") as bzinput:
with open(filename_json, "x") as jsonOutput:
lines = []
for i, line in enumerate(bzinput):
line_json = ujson.loads(line)