我有一个包含7000行的大型数据文件(虽然不是很大!),如下所示:
# data can be obtained from pastebin
# filename = input.csv
# lots of comments
# wave flux err
0.807172 7.61973e-11 1.18177e-13
0.807375 7.58666e-11 1.18288e-13
0.807577 7.62136e-11 1.18504e-13
0.80778 7.64491e-11 1.19389e-13
0.807982 7.62858e-11 1.18685e-13
0.808185 7.63852e-11 1.19324e-13
0.808387 7.60547e-11 1.18952e-13
0.80859 7.52287e-11 1.18016e-13
0.808792 7.53114e-11 1.18979e-13
0.808995 7.58247e-11 1.20198e-13
# lots of other lines
链接到输入数据:http://pastebin.com/KCW9phzX
我想提取波长在0.807375和0.807982之间的数据 所以输出看起来像这样:
#filename = output.csv
0.807375 7.58666e-11 1.18288e-13
0.807577 7.62136e-11 1.18504e-13
0.80778 7.64491e-11 1.19389e-13
0.807982 7.62858e-11 1.18685e-13
类似链接如下:
https://stackoverflow.com/questions/8956832/python-out-of-memory-on-large-csv-file-numpy/8964779#=
efficient way to extract few lines of data from a large csv data file in python
What is the most efficient way to match list items to lines in a large file in Python?
Extract specific lines from file and create sections of data in python
how to extract elements from a list in python?
How to use numpy.genfromtxt when first column is string and the remaining columns are numbers?
genfromtxt and numpy
答案 0 :(得分:4)
您可以在循环中调用np.genfromtxt(f, max_rows=chunksize)
来以块的形式读取文件。这样,您可以保留NumPy阵列的便利性和速度,同时通过调整chunksize
来控制所需的内存量。
import numpy as np
import warnings
# genfromtxt warns if it encounters an empty file. Let's silence this warnings since
# the code below handles it.
warnings.filterwarnings("ignore", message='genfromtxt', category=UserWarning)
# This reads 2 lines at a time
chunksize = 2
with open('data', 'rb') as fin, open('out.csv', 'w+b') as fout:
while True:
arr = np.genfromtxt(fin, max_rows=chunksize, usecols=(0,1,2),
delimiter='', dtype=float)
if not arr.any(): break
arr = np.atleast_2d(arr)
mask = (arr[:, 0] >= 0.807375) & (arr[:, 0] <= 0.807982)
arr = arr[mask]
# uncomment this print statement to confirm the file is being read in chunks
# print('{}\n{}'.format(arr, '-'*80))
np.savetxt(fout, arr, fmt='%g')
写入out.csv
:
0.807375 7.58666e-11 1.18288e-13
0.807577 7.62136e-11 1.18504e-13
0.80778 7.64491e-11 1.19389e-13
0.807982 7.62858e-11 1.18685e-13
对于大型数据文件,您当然希望将chunksize
增加到远大于2的某个整数。通常,您可以选择chunksize
来获得最佳性能尽可能大,同时仍然在适合RAM的阵列上运行。
上面的代码适用于大文件。对于只有7000行的文件,
import numpy as np
with open('data', 'rb') as fin, open('out.csv', 'w+b') as fout:
arr = np.genfromtxt(fin, usecols=(0,1,2), delimiter='', dtype=float)
mask = (arr[:, 0] >= 0.807375) & (arr[:, 0] <= 0.807982)
arr = arr[mask]
np.savetxt(fout, arr, fmt='%g')
就足够了。
答案 1 :(得分:1)
试试这个:
import pandas as pd
df = pd.read_csv('large_data.csv', usecols=(0,1,2), skiprows=57)
df.columns = [ 'wave', 'flux' , 'err']
df = df[(df['wave'] >= 0.807375) & (df['wave'] <= 0.807982) ]
print df
wave flux err
1 0.807375 7.586660e-11 1.182880e-13
2 0.807577 7.621360e-11 1.185040e-13
3 0.807780 7.644910e-11 1.193890e-13
4 0.807982 7.628580e-11 1.186850e-13
由于你有一些不需要的文字,你可以使用&#39; skiprows&#39;进口标志。另外,pandas建立在numpy之上,因此有chunksize标志
答案 2 :(得分:0)
阅读@ubuntu和@Merlin的回复,以下也可能是一个很好的解决方案。
注意: @ubuntu给出的答案绝对正常。
@Merlin给出的答案不起作用,不完整,但开始时是一个很好的模板。
注意:输入文件input.csv可以从pastebin获得:
http://pastebin.com/KCW9phzX
使用numpy:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Author : Bhishan Poudel
# Date : May 23, 2016
# Imports
import pandas as pd
import numpy as np
# using numpy
infile = 'input.csv'
outfile = 'output.csv'
lower_value = 0.807375
upper_value = 0.807982
print('{} {} {}'.format('Reading file :', infile, ''))
print('{} {} {}'.format('Writing to file :', outfile, ''))
with open(infile, 'rb') as fin, open(outfile, 'w+b') as fout:
arr = np.genfromtxt(fin, usecols=(0,1,2), delimiter='', dtype=float)
mask = (arr[:, 0] >= lower_value) & (arr[:, 0] <= upper_value )
arr = arr[mask]
np.savetxt(fout, arr, fmt='%g')
使用pandas:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Author : Bhishan Poudel
# Date : May 23, 2016
# Imports
import pandas as pd
import numpy as np
# extract range
infile = 'input.csv'
outfile = 'output.csv'
lower_value = 0.807375
upper_value = 0.807982
print('{} {} {}'.format('Reading file :', infile, ''))
print('{} {} {}'.format('Writing to a file : ', outfile, ''))
df = pd.read_csv(infile, usecols=(0,1,2), skiprows=57,sep='\s+')
df.columns = [ 'col0', 'col1' , 'col2']
df = df[(df['col0'] >= lower_value) & (df['col0'] <= upper_value) ]
df.to_csv(outfile, header=None, index=None, mode='w', sep=' ')