从Python的第n行读取大型CSV文件(不是从头开始)

时间:2017-02-06 08:28:05

标签: python performance csv bigdata

我有3个包含气候数据的巨大CSV文件,每个文件大约5GB。 每行中的第一个单元是气象站的编号(从0到大约100,000),每个站在每个文件中包含1到800行,在所有文件中不一定相等。例如,Station 11分别在file1,file2和file3中有600,500和200行。 我想读取每个站的所有行,对它们执行一些操作,然后将结果写入另一个文件,然后写入下一个站等。 文件太大而无法在内存中一次加载,所以我尝试了一些解决方案以最小的内存负载读取它们,例如this postthis post,其中包含此方法:< / p>

with open(...) as f:
    for line in f:
        <do something with line> 

这个方法的问题是它每次从头开始读取文件,而我想读取文件如下:

for station in range (100798):
    with open (file1) as f1, open (file2) as f2, open (file3) as f3:
        for line in f1:
            st = line.split(",")[0]
            if st == station:
                <store this line for some analysis>
            else:
                break   # break the for loop and go to read the next file
        for line in f2:
            ...
            <similar code to f1>
            ...
        for line in f3:
            ...
            <similar code to f1>
            ...
    <do the analysis to station, the go to next station>

问题在于每次我重新开始下一站时,for循环将从头开始,而我希望它从第n行发生'Break'开始,即继续读取文件

我该怎么做?

提前致谢

备注关于以下解决方案: 正如我在下面提到答案时提​​到的那样,我实现了@DerFaizio的答案,但我发现处理速度很慢。

在我尝试了@ PM_2Ring提交的基于生成器的答案后,我发现它非常快。也许是因为它取决于发电机。

两种解决方案之间的差异可以通过每分钟处理站的数量注意到,基于发生器的解决方案为2500 st / min,基于Pandas的解决方案为45 st / min 。其中基于发电机的解决方案<55>快

我将保留以下两个实现以供参考。 非常感谢所有贡献者,特别是@ PM_2Ring。

4 个答案:

答案 0 :(得分:2)

下面的代码逐行迭代文件,依次从每个文件中抓取每个工作站的行,并将它们附加到列表中进行进一步处理。

此代码的核心是生成器file_buff,它生成文件的行,但允许我们将行推回以供以后读取。当我们为下一站读取一行时,我们可以将其发送回file_buff,以便我们可以在处理该站的线路时重新读取它。

为了测试这段代码,我使用create_data创建了一些简单的虚假站点数据。

from random import seed, randrange

seed(123)

station_hi = 5
def create_data():
    ''' Fill 3 files with fake station data '''
    fbase = 'datafile_'
    for fnum in range(1, 4):
        with open(fbase + str(fnum), 'w') as f:
            for snum in range(station_hi):
                for i in range(randrange(1, 4)):
                    s = '{1} data{0}{1}{2}'.format(fnum, snum, i)
                    print(s)
                    f.write(s + '\n')
        print()

create_data()

# A file buffer that you can push lines back to
def file_buff(fh):
    prev = None
    while True:
        while prev:
            yield prev
            prev = yield prev
        prev = yield next(fh)

# An infinite counter that yields numbers converted to strings
def str_count(start=0):
    n = start
    while True: 
        yield str(n)
        n += 1

# Extract station data from all 3 files
with open('datafile_1') as f1, open('datafile_2') as f2, open('datafile_3') as f3:
    fb1, fb2, fb3 = file_buff(f1), file_buff(f2), file_buff(f3)

    for snum_str in str_count():
        station_lines = []
        for fb in (fb1, fb2, fb3):
            for line in fb:
                #Extract station number string & station data
                sid, sdata = line.split()
                if sid != snum_str:
                    # This line contains data for the next station,
                    # so push it back to the buffer
                    rc = fb.send(line)
                    # and go to the next file
                    break
                # Otherwise, append this data
                station_lines.append(sdata)

        #Process all the data lines for this station
        if not station_lines:
            #There's no more data to process
            break
        print('Station', snum_str)
        print(station_lines)

<强>输出

0 data100
1 data110
1 data111
2 data120
3 data130
3 data131
4 data140
4 data141

0 data200
1 data210
2 data220
2 data221
3 data230
3 data231
3 data232
4 data240
4 data241
4 data242

0 data300
0 data301
1 data310
1 data311
2 data320
3 data330
4 data340

Station 0
['data100', 'data200', 'data300', 'data301']
Station 1
['data110', 'data111', 'data210', 'data310', 'data311']
Station 2
['data120', 'data220', 'data221', 'data320']
Station 3
['data130', 'data131', 'data230', 'data231', 'data232', 'data330']
Station 4
['data140', 'data141', 'data240', 'data241', 'data242', 'data340']

此代码可以处理来自一个或两个文件的特定工作站缺少工作站数据的情况,但如果所有三个文件中缺少工作站数据则不会,因为它会在{{1}时中断主处理循环列表是空的,但这对您的数据来说不应该是一个问题。

有关生成器和station_lines方法的详细信息,请参阅文档中的6.2.9. Yield expressions

此代码是使用Python 3开发的,但它也可以在Python 2.6+上运行(您只需要在脚本的顶部包含generator.send)。

如果所有3个文件中都可能缺少工作站ID,我们可以轻松处理。只需使用简单的from __future__ import print_function循环代替无限range生成器。

str_count

<强>输出

from random import seed, randrange

seed(123)

station_hi = 7
def create_data():
    ''' Fill 3 files with fake station data '''
    fbase = 'datafile_'
    for fnum in range(1, 4):
        with open(fbase + str(fnum), 'w') as f:
            for snum in range(station_hi):
                for i in range(randrange(0, 2)):
                    s = '{1} data{0}{1}{2}'.format(fnum, snum, i)
                    print(s)
                    f.write(s + '\n')
        print()

create_data()

# A file buffer that you can push lines back to
def file_buff(fh):
    prev = None
    while True:
        while prev:
            yield prev
            prev = yield prev
        prev = yield next(fh)

station_start = 0
station_stop = station_hi

# Extract station data from all 3 files
with open('datafile_1') as f1, open('datafile_2') as f2, open('datafile_3') as f3:
    fb1, fb2, fb3 = file_buff(f1), file_buff(f2), file_buff(f3)

    for i in range(station_start, station_stop):
        snum_str = str(i)
        station_lines = []
        for fb in (fb1, fb2, fb3):
            for line in fb:
                #Extract station number string & station data
                sid, sdata = line.split()
                if sid != snum_str:
                    # This line contains data for the next station,
                    # so push it back to the buffer
                    rc = fb.send(line)
                    # and go to the next file
                    break
                # Otherwise, append this data
                station_lines.append(sdata)

        if not station_lines:
            continue
        print('Station', snum_str)
        print(station_lines)

答案 1 :(得分:0)

我建议使用pandas.read_csv。您可以使用 skiprows 指定要跳过的行,并使用 nrows 根据您的文件大小加载合理的行数 这是文档的链接: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

答案 2 :(得分:0)

我在@ PM-2Ring发布他的解决方案之前发布了下面的代码。 我想让两个解决方案都保持活跃状态​​:

依赖于Pandas库的#1解决方案(由@DerFaizio提供)。 :

此解决方案在120分钟内完成5450个站点(约45个站/分钟)

import pandas as pd
skips =[1, 1, 1]  # to skip the header row forever
for station_number in range(100798):
    storage = {}
    tmax = pd.read_csv(full_paths[0], skiprows=skips[0], header=None, nrows=126000, usecols=[0, 1, 3])
    tmin = pd.read_csv(full_paths[1], skiprows=skips[1], header=None, nrows=126000, usecols=[0, 1, 3])
    tavg = pd.read_csv(full_paths[2], skiprows=skips[2], header=None, nrows=126000, usecols=[0, 1, 3])

    # tmax is at position 0
    for idx, station in enumerate(tmax[0]):
        if station == station_number:
            date_val = tmax[1][idx]
            t_val = float(tmax[3][idx])/10.
            storage[date_val] = [t_val, None, None]
            skips[0] += 1
        else:
            break
    # tmin is at position 1
    for idx, station in enumerate(tmin[0]):
        # station, date_val, _, val = lne.split(",")
        if station == station_number:
            date_val = tmin[1][idx]
            t_val = float(tmin[3][idx]) / 10.
            if date_val in storage:
                storage[date_val][1] = t_val
            else:
                storage[date_val] = [None, t_val, None]
            skips[1] += 1
        else:
            break
    # tavg is at position 2
    for idx, station in enumerate(tavg[0]):
        ...
        # similar to Tmin
        ...
        pass

    station_info = []
    for key in storage.keys():
        # do some analysis
        # Fill the list station_info 
        pass
    data_out.writerows(station_info)

以下解决方案是基于发电机的解决方案(@ PM-2Ring)

此解决方案在12分钟内完成30000个站点(约2500站/分钟)

files = ['Tmax', 'Tmin', 'Tavg']
headers = ['Nesr_Id', 'r_Year', 'r_Month', 'r_Day', 'Tmax', 'Tmin', 'Tavg']

# A file buffer that you can push lines back to
def file_buff(fh):
    prev = None
    while True:
        while prev:
            yield prev
            prev = yield prev
        prev = yield next(fh)

# An infinite counter that yields numbers converted to strings
def str_count(start=0):
    n = start
    while True:
        yield str(n)
        n += 1

# NULL = -999.99
print "Time started: {}".format(time.strftime('%Y-%m-%d %H:%M:%S'))
with open('Results\\GHCN_Daily\\Important\\Temp_All_out_gen.csv', 'wb+') as out_file:
    data_out = csv.writer(out_file, quoting=csv.QUOTE_NONE, quotechar='', delimiter=',', escapechar='\\',
                          lineterminator='\n')
    data_out.writerow(headers)
    full_paths = [os.path.join(source, '{}.csv'.format(file_name)) for file_name in files]
    # Extract station data from all 3 files
    with open(full_paths[0]) as f1, open(full_paths[1]) as f2, open(full_paths[0]) as f3:
        fb1, fb2, fb3 = file_buff(f1), file_buff(f2), file_buff(f3)

        for snum_str in str_count():
            # station_lines = []
            storage ={}
            count = [0, 0, 0]
            for file_id, fb in enumerate((fb1, fb2, fb3)):
                for line in fb:
                    if not isinstance(get__proper_data_type(line.split(",")[0]), str):
                        # Extract station number string & station data
                        sid, date_val, _dummy, sdata = line.split(",")
                        if sid != snum_str:
                            # This line contains data for the next station,
                            # so push it back to the buffer
                            rc = fb.send(line)
                            # and go to the next file
                            break
                        # Otherwise, append this data
                        sdata = float(sdata) / 10.
                        count[file_id] += 1
                        if date_val in storage:
                            storage[date_val][file_id] = sdata
                        else:
                            storage[date_val]= [sdata, None, None]
                        # station_lines.append(sdata)

            # # Process all the data lines for this station
            # if not station_lines:
            #     # There's no more data to process
            #     break
            print "St# {:6d}/100797. Time: {}. Tx({}), Tn({}), Ta({}) ".\
                format(int(snum_str), time.strftime('%H:%M:%S'), count[0], count[1], count[2])
            # print(station_lines)

            station_info = []
            for key in storage.keys():
                # key_val = storage[key]
                tx, tn, ta = storage[key]
                if ta is None:
                    if tx != None and tn != None:
                        ta = round((tx + tn) / 2., 1)
                if tx is None:
                    if tn != None and ta != None:
                        tx = round(2. * ta - tn, 1)
                if tn is None:
                    if tx != None and ta != None:
                        tn = round(2. * ta - tx, 1)
                # print key,
                py_date = from_excel_ordinal(int(key))
                # print py_date
                station_info.append([snum_str, py_date.year, py_date.month, py_date.day, tx, tn, ta])

            data_out.writerows(station_info)
            del station_info

谢谢大家。

答案 3 :(得分:-1)

使用内置的csv模块,您可以执行以下操作:

// Mini-config
$make_attribute_code = 'make';
$model_attribute_code = 'model';
$filter_by_make = 'Manufacturer A';
$filter_by_model = 'Type A';
$load_from_category = 10117;

// Parse attribute option id of "Make"
$make_option_id = null;
$makeAttributeId = Mage::getResourceModel('eav/entity_attribute')->getIdByCode('catalog_product', $make_attribute_code);
$attribute = Mage::getModel('catalog/resource_eav_attribute')->load($makeAttributeId);
$attributeOptions = $attribute ->getSource()->getAllOptions();
foreach ($attributeOptions as $attributeOption) {
    if (trim($attributeOption['label']) == $filter_by_make) {
        $make_option_id = (int)trim($attributeOption['value']);
        break;
    }
}

// Parse attribute option id of "Model"
$model_option_id = null;
$modelAttributeId = Mage::getResourceModel('eav/entity_attribute')->getIdByCode('catalog_product', $model_attribute_code);
$attribute = Mage::getModel('catalog/resource_eav_attribute')->load($modelAttributeId);
$attributeOptions = $attribute ->getSource()->getAllOptions();
foreach ($attributeOptions as $attributeOption) {
    if (trim($attributeOption['label']) == $filter_by_model) {
        $model_option_id = (int)trim($attributeOption['value']);
        break;
    }
}

// Load category products having selected "Make" & "Model"
$products = Mage::getModel('catalog/category')
    ->load($load_from_category)
    ->getProductCollection()
    ->addAttributeToSelect('*')
    ->addAttributeToFilter(array(array('attribute' => $make_attribute_code, 'eq' => $make_option_id)))
    ->addAttributeToFilter(array(array('attribute' => $model_attribute_code, 'eq' => $model_option_id)))
    ->joinField(
        'qty',
        'cataloginventory/stock_item',
        'qty',
        'product_id=entity_id',
        '{{table}}.stock_id=1',
        'left'
    );

其中n是您要跳过的行数。