所有 我有流程来展平.json文件,我使用逻辑simular来logic。在我的真实数据上,我有2.5 Gigs,所以在运行3小时后,我决定取消它并过滤输入,因为我只需要它的一部分(~5%),让我说我想把所有的原始名称带到名字= NWest。
当我做简单的readLines时,R给我一些新结构(对我而言)>
Named Char[1:88888]
,
我试图为它指定名称但没有成功,如何实现这一目标?
I"对R / Json有点新意,所以感谢你的领导,我觉得R应该有一些东西,我怎么能带来像#NW;%'%NWest%'
fileName = "test.json"
con = file(fileName, open="r")
line = readLines(con)
names(line)
str()
names(line) <- "colx"
这是我的输入:我了解到子集中的R应该引用列,但如果没有列,它会怎么做?
{"batch_date": "2015-05", "name": "Jeff Macronsh", "cust_cid": "001555", "clients": ["111112222", "1324132531", "1235325", "1324324321"], "fans": 2319, "rewards": 3.75, "type": "dealer", "bonuses": {"suka": 13, "plain": 4, "writer": 1, "maxima": 1, "more": 1, "prima": 5}, "lexus": []}
{"batch_date": "2014-07", "name": "NWest", "cust_cid": "332224", "clients": ["093485734250"], "fans": 1, "rewards": 4.5, "type": "dealer", "bonuses": {"note": 12, "suv": 10, "prima": 1}, "lexus": []}
{"batch_date": "2014-11", "name": "Muhhamed Karne", "cust_cid": "234566000", "bonuses": {"profile": 5, "suv": 52, "cute": 1, "plain": 43, "bbb": 35, "note": 33, "photos": 3, "maxima": 56, "more": 12, "prima": 151}, "lexus": [2013, 2014]}
{"batch_date": "2013-11", "name": "West", "cust_cid": "4567465800", "bonuses": {"plain": 1, "maxima": 1, "more": 2, "photos": 1, "suv": 1}, "lexus": []}
{"batch_date": "2014-02", "name": "Jake", "cust_cid": "6467889000", "bonuses": {"cute": 1, "suv": 30, "plain": 43, "writer": 38, "note": 16, "photos": 2, "maxima": 33, "prima": 39, "more": 5}, "lexus": [2012, 2014, 2015]}
{"batch_date": "2014-11", "name": "Michelle Mow", "cust_cid": "345653477", "bonuses": {"maxima": 1, "write": 15, "platinum": 33}, "lexus": []}
{"batch_date": "2015-07", "name": "NWest", "cust_cid": "332224", "clients": ["093485734250", "4313124324"], "bonuses": {"note": 12, "suv": 90, "prima": 1}, "lexus": []}
答案 0 :(得分:1)
这是一种从大文件中选择行子集的方法。这将读入输入的一部分(在这种情况下,示例为20行,但为文件增大),因为我不知道您的系统有多少内存。这将创建一个临时文件,其中包含您可以处理的子集。
> # subset a large file by reading a small number of lines, finding a
> # match and then writing to a new file. this assumes that each line
> # is a complete json set.
> n <- 20 # number of lines -- make large for your 2.5GB file
> output_file <- tempfile() # output file
> output <- file(output_file, 'wt') # text file on output
> input_file <- '/temp/json.txt' # my input file of your data (2400 lines)
> input <- file(input_file, 'rt')
> repeat{
+ lines <- readLines(input, n = n)
+ if (length(lines) == 0L) break # exit if done
+ # find matching lines
+ mch <- grep('"name": "NWest"', lines) # see if any matches
+ if (length(mch) == 0L) next # no match, read next set
+ writeLines(lines[mch], output) # write out lines that match
+ }
> close(output)
> # show sizes of the files to show that output is a subset of input
> file.info(input_file)
size isdir mode mtime ctime atime exe
/temp/json.txt 503360 FALSE 666 2015-10-11 18:00:02 2015-10-11 18:00:02 2015-10-11 18:00:02 no
> file.info(output_file)
size isdir mode mtime
C:\\Users\\jh52822\\AppData\\Local\\Temp\\Rtmpuc6YDS\\file25b81f2f1700 131648 FALSE 666 2015-10-11 18:15:16
ctime atime
C:\\Users\\jh52822\\AppData\\Local\\Temp\\Rtmpuc6YDS\\file25b81f2f1700 2015-10-11 18:15:15 2015-10-11 18:15:15
exe
C:\\Users\\jh52822\\AppData\\Local\\Temp\\Rtmpuc6YDS\\file25b81f2f1700 no
答案 1 :(得分:0)
您可能想尝试jqr
https://github.com/ropensci/jqr - 包含jq
(https://stedolan.github.io/jq/)的R客户端 - 也值得在R外部的shell中尝试jq
- 我们没有对非常大的数据进行过jqr
测试,所以希望它有效。它应该因为它调用jq C代码。如果它太大而无法读入R,那么使用shell会更好