Question

所有我有流程来展平.json文件，我使用逻辑simular来logic。在我的真实数据上，我有2.5 Gigs，所以在运行3小时后，我决定取消它并过滤输入，因为我只需要它的一部分（~5％），让我说我想把所有的原始名称带到名字= NWest。

当我做简单的readLines时，R给我一些新结构（对我而言）＆gt; Named Char[1:88888]，我试图为它指定名称但没有成功，如何实现这一目标？ I＆＃34;对R / Json有点新意，所以感谢你的领导，我觉得R应该有一些东西，我怎么能带来像＃NW;％＆＃39;％NWest％＆＃39;

fileName = "test.json"    
con = file(fileName, open="r")
line = readLines(con) 
names(line)
str()
names(line) <- "colx"

这是我的输入：我了解到子集中的R应该引用列，但如果没有列，它会怎么做？

{"batch_date": "2015-05",  "name": "Jeff Macronsh", "cust_cid": "001555", "clients": ["111112222", "1324132531", "1235325", "1324324321"], "fans": 2319, "rewards": 3.75, "type": "dealer", "bonuses": {"suka": 13, "plain": 4, "writer": 1, "maxima": 1, "more": 1, "prima": 5}, "lexus": []}
{"batch_date": "2014-07",  "name": "NWest", "cust_cid": "332224", "clients": ["093485734250"], "fans": 1, "rewards": 4.5, "type": "dealer", "bonuses": {"note": 12, "suv": 10, "prima": 1}, "lexus": []}
{"batch_date": "2014-11",  "name": "Muhhamed Karne", "cust_cid": "234566000",  "bonuses": {"profile": 5, "suv": 52, "cute": 1, "plain": 43, "bbb": 35, "note": 33, "photos": 3, "maxima": 56, "more": 12, "prima": 151}, "lexus": [2013, 2014]}
{"batch_date": "2013-11",  "name": "West", "cust_cid": "4567465800",  "bonuses": {"plain": 1, "maxima": 1, "more": 2, "photos": 1, "suv": 1}, "lexus": []}
{"batch_date": "2014-02",  "name": "Jake", "cust_cid": "6467889000",  "bonuses": {"cute": 1, "suv": 30, "plain": 43, "writer": 38, "note": 16, "photos": 2, "maxima": 33, "prima": 39, "more": 5}, "lexus": [2012, 2014, 2015]}
{"batch_date": "2014-11",  "name": "Michelle Mow", "cust_cid": "345653477",  "bonuses": {"maxima": 1, "write": 15, "platinum": 33}, "lexus": []}
{"batch_date": "2015-07",  "name": "NWest", "cust_cid": "332224", "clients": ["093485734250", "4313124324"],  "bonuses": {"note": 12, "suv": 90, "prima": 1}, "lexus": []}

Answer 1

这是一种从大文件中选择行子集的方法。这将读入输入的一部分（在这种情况下，示例为20行，但为文件增大），因为我不知道您的系统有多少内存。这将创建一个临时文件，其中包含您可以处理的子集。

> # subset a large file by reading a small number of lines, finding a 
> # match and then writing to a new file.  this assumes that each line
> # is a complete json set.
> n <- 20  # number of lines -- make large for your 2.5GB file
> output_file <- tempfile()  # output file
> output <- file(output_file, 'wt')  # text file on output
> input_file <- '/temp/json.txt'  # my input file of your data (2400 lines)
> input <- file(input_file, 'rt')
> repeat{
+     lines <- readLines(input, n = n)
+     if (length(lines) == 0L) break  # exit if done
+     # find matching lines
+     mch <- grep('"name": "NWest"', lines)  # see if any matches
+     if (length(mch) == 0L) next  # no match, read next set
+     writeLines(lines[mch], output)  # write out lines that match
+ }
> close(output)
> # show sizes of the files to show that output is a subset of input
> file.info(input_file)
                 size isdir mode               mtime               ctime               atime exe
/temp/json.txt 503360 FALSE  666 2015-10-11 18:00:02 2015-10-11 18:00:02 2015-10-11 18:00:02  no
> file.info(output_file)
                                                                         size isdir mode               mtime
C:\\Users\\jh52822\\AppData\\Local\\Temp\\Rtmpuc6YDS\\file25b81f2f1700 131648 FALSE  666 2015-10-11 18:15:16
                                                                                     ctime               atime
C:\\Users\\jh52822\\AppData\\Local\\Temp\\Rtmpuc6YDS\\file25b81f2f1700 2015-10-11 18:15:15 2015-10-11 18:15:15
                                                                       exe
C:\\Users\\jh52822\\AppData\\Local\\Temp\\Rtmpuc6YDS\\file25b81f2f1700  no

Answer 2

您可能想尝试jqr https://github.com/ropensci/jqr - 包含jq（https://stedolan.github.io/jq/）的R客户端 - 也值得在R外部的shell中尝试jq - 我们没有对非常大的数据进行过jqr测试，所以希望它有效。它应该因为它调用jq C代码。如果它太大而无法读入R，那么使用shell会更好

R如何配置.json文件（如平面），

2 个答案: