Python - 将x行的csv文件写入json文件

时间:2016-08-16 16:05:23

标签: python json csv

我有一个csv文件,我需要以1000行的形式写入json文件.csv文件有大约9,000行,所以理想情况下我最终会得到9个连续数据的独立json文件。

我知道如何将一个csv文件写入json - 我一直在做的事情:

csvfile = open("C:\\Users\Me\Desktop\data\data.csv", 'r', encoding="utf8")

reader = csv.DictReader(csvfile, delimiter = ",")
out = json.dumps( [ row for row in reader ] )

with open("C:\\Users\Me\Desktop\data\data.json", 'w') as f:
f.write(out)

效果很好。但我需要json文件是9个分割文件。现在,我假设我要么:

1)尝试计算并在达到1,000

时停止

2)将csv文件写入单个json文件,然后打开json并尝试以某种方式将其拆分。

我很失落如何实现这一点 - 任何帮助表示赞赏!

4 个答案:

答案 0 :(得分:2)

将整个CSV文件读入一个或多个行,然后将长度为1000的切片写入JSON文件。

import csv
import json

input_file = 'C:\\Users\\Me\\Desktop\\data\\data.csv'
output_file_template = 'C:\\Users\\Me\\Desktop\\data\\data_{}.json'

with open(input_file, 'r', encoding='utf8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter=',')
    rows = list(reader)

for i in range(len(rows) // 1000):
    out = json.dumps(rows[1000*i:1000*(i+1)])
    with open(output_file_template.format(i), 'w') as f:
        f.write(out)

答案 1 :(得分:2)

您可以迭代(减少内存使用量),而不是读取整个CSV文件。

例如,这是一个简单的行迭代:

with open(input_file, 'r', encoding='utf8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter=',')
    for row in reader:
        print(row)

在迭代期间,您可以枚举行并使用此值来计算1000行的组:

group_size = 1000

with open(input_file, 'r', encoding='utf8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter=',')
    for index, row in enumerate(reader):
        group_idx = index // group_size
        print(group_idx, row)

你应该有这样的东西:

0 [row 0...]
0 [row 1...]
0 [row 2...]
...
0 [row 999...]
1 [row 1000...]
1 [row 1001...]
etc.

您可以使用itertools.groupby将您的行分组1000。

使用Alberto Garcia-Raboso的解决方案,您可以使用:

from __future__ import division

import csv
import json
import itertools

input_file = 'C:\\Users\\Me\\Desktop\\data\\data.csv'
output_file_template = 'C:\\Users\\Me\\Desktop\\data\\data_{}.json'

group_size = 1000

with open(input_file, 'r', encoding='utf8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter=',')
    for key, group in itertools.groupby(enumerate(rows),
                                        key=lambda item: item[0] // group_size):
       grp_rows = [item[1] for item in group]
       content = json.dumps(grp_rows)
       with open(output_file_template.format(key), 'w') as jsonfile:
           jsonfile.write(content)

例如一些假数据:

from __future__ import division
import itertools

rows = [[1, 2], [3, 4], [5, 6], [7, 8],
        [1, 2], [3, 4], [5, 6], [7, 8],
        [1, 2], [3, 4], [5, 6], [7, 8],
        [1, 2], [3, 4], [5, 6], [7, 8],
        [1, 2], [3, 4], [5, 6], [7, 8]]

group_size = 4
for key, group in itertools.groupby(enumerate(rows),
                                    key=lambda item: item[0] // group_size):
    g_rows = [item[1] for item in group]
    print(key, g_rows)

你会得到:

0 [[1, 2], [3, 4], [5, 6], [7, 8]]
1 [[1, 2], [3, 4], [5, 6], [7, 8]]
2 [[1, 2], [3, 4], [5, 6], [7, 8]]
3 [[1, 2], [3, 4], [5, 6], [7, 8]]
4 [[1, 2], [3, 4], [5, 6], [7, 8]]

答案 2 :(得分:0)

没有理由使用Dictreader,常规 csv.reader 会很好。您还可以在reader对象上使用 itertool.islice 将数据切片为n行,并将每个集合转储到新文件中:

from itertools import islice, count
import csv
import json    

with open("C:\\Users\Me\Desktop\data\data.csv") as f:
    reader, cnt = csv.reader(f), count(1)
    for  rows in iter(lambda: list(islice(reader, 1000)), []):
        with open("C:\\Users\Me\Desktop\data\data{}.json".format(next(cnt))) as out:
        json.dump(rows, out)

答案 3 :(得分:-1)

这将读取文件data.csv一次,并将创建标识data_1.jsondata_9.json的单独json文件,因为有9000行。

只要data.csv中的行数是1000的倍数,它就会创建number_of_rows/1000个文件而无需更改代码。

csvfile = open("C:\\Users\Me\Desktop\data\data.csv", 'rb', encoding="utf8")

reader = csv.DictReader(csvfile, delimiter = ",")

r = []
counter = 0
fileid = 1

for row in reader:
    r.append( row )
    counter += 1
    if counter == 999:
        out = json.dumps( r )
        fname = "C:\\Users\Me\Desktop\data\data_"+ str(fileid) + ".json"
        with open( fname, 'wb' ) as f:
            f.write( out )

        # resetting & updating variables
        fileid += 1
        counter = 0
        r = []
        out = None