
时间:2017-02-06 08:28:05

标签: python performance csv bigdata

我有3个包含气候数据的巨大CSV文件,每个文件大约5GB。 每行中的第一个单元是气象站的编号(从0到大约100,000),每个站在每个文件中包含1到800行,在所有文件中不一定相等。例如,Station 11分别在file1,file2和file3中有600,500和200行。 我想读取每个站的所有行,对它们执行一些操作,然后将结果写入另一个文件,然后写入下一个站等。 文件太大而无法在内存中一次加载,所以我尝试了一些解决方案以最小的内存负载读取它们,例如this postthis post,其中包含此方法:< / p>

with open(...) as f:
    for line in f:
        <do something with line> 


for station in range (100798):
    with open (file1) as f1, open (file2) as f2, open (file3) as f3:
        for line in f1:
            st = line.split(",")[0]
            if st == station:
                <store this line for some analysis>
                break   # break the for loop and go to read the next file
        for line in f2:
            <similar code to f1>
        for line in f3:
            <similar code to f1>
    <do the analysis to station, the go to next station>




备注关于以下解决方案: 正如我在下面提到答案时提​​到的那样,我实现了@DerFaizio的答案,但我发现处理速度很慢。

在我尝试了@ PM_2Ring提交的基于生成器的答案后,我发现它非常快。也许是因为它取决于发电机。

两种解决方案之间的差异可以通过每分钟处理站的数量注意到,基于发生器的解决方案为2500 st / min,基于Pandas的解决方案为45 st / min 。其中基于发电机的解决方案<55>快

我将保留以下两个实现以供参考。 非常感谢所有贡献者,特别是@ PM_2Ring。

4 个答案:

答案 0 :(得分:2)




from random import seed, randrange


station_hi = 5
def create_data():
    ''' Fill 3 files with fake station data '''
    fbase = 'datafile_'
    for fnum in range(1, 4):
        with open(fbase + str(fnum), 'w') as f:
            for snum in range(station_hi):
                for i in range(randrange(1, 4)):
                    s = '{1} data{0}{1}{2}'.format(fnum, snum, i)
                    f.write(s + '\n')


# A file buffer that you can push lines back to
def file_buff(fh):
    prev = None
    while True:
        while prev:
            yield prev
            prev = yield prev
        prev = yield next(fh)

# An infinite counter that yields numbers converted to strings
def str_count(start=0):
    n = start
    while True: 
        yield str(n)
        n += 1

# Extract station data from all 3 files
with open('datafile_1') as f1, open('datafile_2') as f2, open('datafile_3') as f3:
    fb1, fb2, fb3 = file_buff(f1), file_buff(f2), file_buff(f3)

    for snum_str in str_count():
        station_lines = []
        for fb in (fb1, fb2, fb3):
            for line in fb:
                #Extract station number string & station data
                sid, sdata = line.split()
                if sid != snum_str:
                    # This line contains data for the next station,
                    # so push it back to the buffer
                    rc = fb.send(line)
                    # and go to the next file
                # Otherwise, append this data

        #Process all the data lines for this station
        if not station_lines:
            #There's no more data to process
        print('Station', snum_str)


0 data100
1 data110
1 data111
2 data120
3 data130
3 data131
4 data140
4 data141

0 data200
1 data210
2 data220
2 data221
3 data230
3 data231
3 data232
4 data240
4 data241
4 data242

0 data300
0 data301
1 data310
1 data311
2 data320
3 data330
4 data340

Station 0
['data100', 'data200', 'data300', 'data301']
Station 1
['data110', 'data111', 'data210', 'data310', 'data311']
Station 2
['data120', 'data220', 'data221', 'data320']
Station 3
['data130', 'data131', 'data230', 'data231', 'data232', 'data330']
Station 4
['data140', 'data141', 'data240', 'data241', 'data242', 'data340']


有关生成器和station_lines方法的详细信息,请参阅文档中的6.2.9. Yield expressions

此代码是使用Python 3开发的,但它也可以在Python 2.6+上运行(您只需要在脚本的顶部包含generator.send)。

如果所有3个文件中都可能缺少工作站ID,我们可以轻松处理。只需使用简单的from __future__ import print_function循环代替无限range生成器。



from random import seed, randrange


station_hi = 7
def create_data():
    ''' Fill 3 files with fake station data '''
    fbase = 'datafile_'
    for fnum in range(1, 4):
        with open(fbase + str(fnum), 'w') as f:
            for snum in range(station_hi):
                for i in range(randrange(0, 2)):
                    s = '{1} data{0}{1}{2}'.format(fnum, snum, i)
                    f.write(s + '\n')


# A file buffer that you can push lines back to
def file_buff(fh):
    prev = None
    while True:
        while prev:
            yield prev
            prev = yield prev
        prev = yield next(fh)

station_start = 0
station_stop = station_hi

# Extract station data from all 3 files
with open('datafile_1') as f1, open('datafile_2') as f2, open('datafile_3') as f3:
    fb1, fb2, fb3 = file_buff(f1), file_buff(f2), file_buff(f3)

    for i in range(station_start, station_stop):
        snum_str = str(i)
        station_lines = []
        for fb in (fb1, fb2, fb3):
            for line in fb:
                #Extract station number string & station data
                sid, sdata = line.split()
                if sid != snum_str:
                    # This line contains data for the next station,
                    # so push it back to the buffer
                    rc = fb.send(line)
                    # and go to the next file
                # Otherwise, append this data

        if not station_lines:
        print('Station', snum_str)

答案 1 :(得分:0)

我建议使用pandas.read_csv。您可以使用 skiprows 指定要跳过的行,并使用 nrows 根据您的文件大小加载合理的行数 这是文档的链接: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

答案 2 :(得分:0)

我在@ PM-2Ring发布他的解决方案之前发布了下面的代码。 我想让两个解决方案都保持活跃状态​​:

依赖于Pandas库的#1解决方案(由@DerFaizio提供)。 :


import pandas as pd
skips =[1, 1, 1]  # to skip the header row forever
for station_number in range(100798):
    storage = {}
    tmax = pd.read_csv(full_paths[0], skiprows=skips[0], header=None, nrows=126000, usecols=[0, 1, 3])
    tmin = pd.read_csv(full_paths[1], skiprows=skips[1], header=None, nrows=126000, usecols=[0, 1, 3])
    tavg = pd.read_csv(full_paths[2], skiprows=skips[2], header=None, nrows=126000, usecols=[0, 1, 3])

    # tmax is at position 0
    for idx, station in enumerate(tmax[0]):
        if station == station_number:
            date_val = tmax[1][idx]
            t_val = float(tmax[3][idx])/10.
            storage[date_val] = [t_val, None, None]
            skips[0] += 1
    # tmin is at position 1
    for idx, station in enumerate(tmin[0]):
        # station, date_val, _, val = lne.split(",")
        if station == station_number:
            date_val = tmin[1][idx]
            t_val = float(tmin[3][idx]) / 10.
            if date_val in storage:
                storage[date_val][1] = t_val
                storage[date_val] = [None, t_val, None]
            skips[1] += 1
    # tavg is at position 2
    for idx, station in enumerate(tavg[0]):
        # similar to Tmin

    station_info = []
    for key in storage.keys():
        # do some analysis
        # Fill the list station_info 

以下解决方案是基于发电机的解决方案(@ PM-2Ring)


files = ['Tmax', 'Tmin', 'Tavg']
headers = ['Nesr_Id', 'r_Year', 'r_Month', 'r_Day', 'Tmax', 'Tmin', 'Tavg']

# A file buffer that you can push lines back to
def file_buff(fh):
    prev = None
    while True:
        while prev:
            yield prev
            prev = yield prev
        prev = yield next(fh)

# An infinite counter that yields numbers converted to strings
def str_count(start=0):
    n = start
    while True:
        yield str(n)
        n += 1

# NULL = -999.99
print "Time started: {}".format(time.strftime('%Y-%m-%d %H:%M:%S'))
with open('Results\\GHCN_Daily\\Important\\Temp_All_out_gen.csv', 'wb+') as out_file:
    data_out = csv.writer(out_file, quoting=csv.QUOTE_NONE, quotechar='', delimiter=',', escapechar='\\',
    full_paths = [os.path.join(source, '{}.csv'.format(file_name)) for file_name in files]
    # Extract station data from all 3 files
    with open(full_paths[0]) as f1, open(full_paths[1]) as f2, open(full_paths[0]) as f3:
        fb1, fb2, fb3 = file_buff(f1), file_buff(f2), file_buff(f3)

        for snum_str in str_count():
            # station_lines = []
            storage ={}
            count = [0, 0, 0]
            for file_id, fb in enumerate((fb1, fb2, fb3)):
                for line in fb:
                    if not isinstance(get__proper_data_type(line.split(",")[0]), str):
                        # Extract station number string & station data
                        sid, date_val, _dummy, sdata = line.split(",")
                        if sid != snum_str:
                            # This line contains data for the next station,
                            # so push it back to the buffer
                            rc = fb.send(line)
                            # and go to the next file
                        # Otherwise, append this data
                        sdata = float(sdata) / 10.
                        count[file_id] += 1
                        if date_val in storage:
                            storage[date_val][file_id] = sdata
                            storage[date_val]= [sdata, None, None]
                        # station_lines.append(sdata)

            # # Process all the data lines for this station
            # if not station_lines:
            #     # There's no more data to process
            #     break
            print "St# {:6d}/100797. Time: {}. Tx({}), Tn({}), Ta({}) ".\
                format(int(snum_str), time.strftime('%H:%M:%S'), count[0], count[1], count[2])
            # print(station_lines)

            station_info = []
            for key in storage.keys():
                # key_val = storage[key]
                tx, tn, ta = storage[key]
                if ta is None:
                    if tx != None and tn != None:
                        ta = round((tx + tn) / 2., 1)
                if tx is None:
                    if tn != None and ta != None:
                        tx = round(2. * ta - tn, 1)
                if tn is None:
                    if tx != None and ta != None:
                        tn = round(2. * ta - tx, 1)
                # print key,
                py_date = from_excel_ordinal(int(key))
                # print py_date
                station_info.append([snum_str, py_date.year, py_date.month, py_date.day, tx, tn, ta])

            del station_info


答案 3 :(得分:-1)


// Mini-config
$make_attribute_code = 'make';
$model_attribute_code = 'model';
$filter_by_make = 'Manufacturer A';
$filter_by_model = 'Type A';
$load_from_category = 10117;

// Parse attribute option id of "Make"
$make_option_id = null;
$makeAttributeId = Mage::getResourceModel('eav/entity_attribute')->getIdByCode('catalog_product', $make_attribute_code);
$attribute = Mage::getModel('catalog/resource_eav_attribute')->load($makeAttributeId);
$attributeOptions = $attribute ->getSource()->getAllOptions();
foreach ($attributeOptions as $attributeOption) {
    if (trim($attributeOption['label']) == $filter_by_make) {
        $make_option_id = (int)trim($attributeOption['value']);

// Parse attribute option id of "Model"
$model_option_id = null;
$modelAttributeId = Mage::getResourceModel('eav/entity_attribute')->getIdByCode('catalog_product', $model_attribute_code);
$attribute = Mage::getModel('catalog/resource_eav_attribute')->load($modelAttributeId);
$attributeOptions = $attribute ->getSource()->getAllOptions();
foreach ($attributeOptions as $attributeOption) {
    if (trim($attributeOption['label']) == $filter_by_model) {
        $model_option_id = (int)trim($attributeOption['value']);

// Load category products having selected "Make" & "Model"
$products = Mage::getModel('catalog/category')
    ->addAttributeToFilter(array(array('attribute' => $make_attribute_code, 'eq' => $make_option_id)))
    ->addAttributeToFilter(array(array('attribute' => $model_attribute_code, 'eq' => $model_option_id)))
