就我而言,我有两个csv文件(file1和file2)。
为了简化我的问题,我想说我想连续读取file1,3 by by和file2 4 by 4的元素。
file1.csv(9行)
1,2,3
3,5,8
7,2,9
10,111,12
13,14,155
31,2,3
3,15,82
8,4,91
12,111,13
file2.csv(12行)
55,12,17
3,6,13
72,1,91
10,0,12
1,1,73
31,2,3
3,15,61
18,6,91
13,33,13
7,1,15
9,17,42
41,8,18
输出中的我想得到:
1,2,3 (from 1. row of file1.csv)
3,5,8 (from 2. row of file1.csv)
7,2,9 (from 3. row of file1.csv)
55,12,17 (from 1. row of file2.csv)
3,6,13 (from 2. row of file2.csv)
72,1,91 (from 3. row of file2.csv)
10,0,12 (from 4. row of file2.csv)
10,111,12 (from 4. row of file1.csv)
13,14,155 (from 5. row of file1.csv)
31,2,3 (from 6. row of file1.csv)
1,1,73 (from 5. row of file2.csv)
31,2,3 (from 6. row of file2.csv)
3,15,61 (from 7. row of file2.csv)
18,6,91 (from 8. row of file2.csv)
3,15,82 (from 7. row of file1.csv)
8,4,91 (from 8. row of file1.csv)
12,111,13 (from 9. row of file1.csv)
13,33,13 (from 9. row of file2.csv)
7,1,15 (from 10. row of file2.csv)
9,17,42 (from 11. row of file2.csv)
41,8,18 (from 12. row of file2.csv)
我的真实数据文件非常大(每个大约1.6 GB),我希望尽可能少地使用内存。为此,我写了一个脚本:
f1, f2, = open(pathInput1, 'r'), open(pathInput2, 'r')
position1, position2 = 0, 0
for i in range(6):
if i%2 == 0:
#print("file1.csv")
sizeOfWindow = 3
sizeOfWindowInactive = 4
f1.seek(position1)
data = []
for l in range(sizeOfWindow):
line = f1.readline()
line = list(map(int, line[:-1].split(",")))
data.append(line)
data = np.array(data)
print(data)
[next(f2) for i in range(sizeOfWindowInactive)]
position1 = f1.tell()
else:
#print("file2.csv")
sizeOfWindow = 4
sizeOfWindowInactive = 3
f2.seek(position2)
data = []
for l in range(sizeOfWindow):
line = f2.readline()
line = list(map(int, line[:-1].split(",")))
data.append(line)
data = np.array(data)
print(data)
[next(f1) for i in range(sizeOfWindowInactive)]
position2 = f2.tell()
编写此脚本后,我注意到我无法同时使用readline()
和next()
。现在我的问题是,如何安排我的脚本观察相同的输出而不需要使用太多内存。
编辑:在我的实际案例中,我有5个文件,每个文件都有自己的sizeOfWindow。根据我读取的数据,我决定使用if语句跳转到文件。所以sizeOfWindow是根据文件修复的。我不定期阅读文件。我决定使用我读过的最后一个数据部分跳转文件。当我读取文件时,我需要移动其他文件的光标而不读取它们的数据。
答案 0 :(得分:0)
由于您只需要按顺序读取文件,因此可以根据需要使用next(f1)
和next(f2)
来获取所需的行。 itertools
模块包含帮助您更轻松的帮助程序。 itertools.islice
会占用几行,因此您不需要自己的next
循环。 itertools.cycle
会在列表中替换项目,因此您无需跟踪下一个文件。把它放在一起:
import itertools
import numpy as np
with open(pathInput1) as f1, open(pathInput2) as f2:
grab_this = ((3, f1), (4, f2))
for num, fp in itertools.cycle(grab_this):
data = np.array(itertools.islice(fp, num))
if not data:
break
print(data)