处理将数据写入文本文件

时间:2018-02-20 09:31:05

标签: python text-files large-files netcdf

我有一个普遍的问题。它涉及大量大文本文件的编写。文本文件的内容基于数据集,并且每个文本文件都有所不同。基本问题是如何最有效地做到这一点。

更具体地说,我想进行空间显式模型运行(裁剪模型)。该模型要求输入文件采用txt格式。因此,如果我希望为大量栅格单元运行模型 - 我需要每个单元格的文本文件(1000s)。在根据气候预测编写天气输入时会出现效率问题。它们每天的时间步长可达100年 - 例如。 36500行(每8个变量)从数据集中提取并写入每个文本文件。

我的第一次尝试是创建一个循环遍历每个位置(即每个文本文件)的for循环,并为每个文本文件循环每个每日气候时间步以创建气候文件的整个字符串,然后将其写入文本文件(我还测试了每次写入文件的时间,但效率相似)。

这种方法需要大约在我(有点旧)的机器上每个文件1-2分钟。适用于70x80细胞的光栅。 7天。当然,我可以缩小位置数量并选择较少的时间步长 - 但是,不过,我想知道是否有更有效的方法来做到这一点?

从研究开始,我认为将每一行/写入/写入文件的for循环是瓶颈,我想知道是否将数据拉入数组或数据帧然后保存到cv会更快?或者您认为这种操作最合适的方法是什么?

提前谢谢!

Bests问候,安东。

这里是代码:

如果我应该提供额外的代码/信息等,请告诉我,因为我现在已经开始编程和教学一个月了 - 道歉事情有点混乱。

weather_dir =  os.path.join(
                        os.getcwd() 
                        , "data" 
                        , "raw" 
                        , "weather"
                        )

precip45l = glob.glob(weather_dir+"/pr*45*d.nc")
tasmax45l = glob.glob(weather_dir+"/tasmax*45*")
tasmin45l = glob.glob(weather_dir+"/tasmin*45*")
evsps45l = glob.glob(weather_dir+"/evsps*45*")

cdo.mergetime(input=precip45l, output= weather_dir+"/precip45.nc")
cdo.mulc("86400", input=weather_dir+"/precip45.nc"
         , output= weather_dir+"/precip45mm.nc" )
precip45 = Dataset(weather_dir+"/precip45mm.nc")

cdo.mergetime(input= tasmax45l, output= weather_dir+"/tasmax45.nc")   
cdo.subc("273.15", input=weather_dir+"/tasmax45.nc"
         , output= weather_dir+"/tasmax45C.nc" )
tasmax45 = Dataset(weather_dir+"/tasmax45C.nc")

cdo.mergetime(input= tasmin45l, output= weather_dir+"/tasmin45.nc")   
cdo.subc("273.15", input=weather_dir+"/tasmin45.nc"
         , output= weather_dir+"/tasmin45C.nc" )
tasmin45 = Dataset(weather_dir+"/tasmin45C.nc")

cdo.mergetime(input= evsps45l, output= weather_dir+"/evsps45.nc")   
cdo.mulc("86400", input=weather_dir+"/evsps45.nc"
         , output= weather_dir+"/evsps45mm.nc" )
evsps45 = Dataset(weather_dir+"/evsps45mm.nc")

datetime_model = netCDF4.num2date(precip45.variables["time"][:]
                                 , "days since 1949-12-1 00:00:00"
                                 )



def create_weather():
    time_length = range(len(datetime_model))
    file_path = os.path.join(os.getcwd(),"data","input" ,"lat")
    for x in lat:
        for y in lon:
            fh = open(os.path.join(file_path+str(x)+"_lon"+str(y),"Weather.txt"), "w")
            fh.write("%% ---------- Weather input time-series for AquaCropOS ---------- %%\n%%Day\tMonth\tYear\tMinTemp\tMaxTemp\tPrecipitation\tReferenceET%%")
            for i in time_length:
                fh.write(
                        "\n"+str(datetime_model[i].day)
                        +"\t"+str(datetime_model[i].month)
                        +"\t"+str(datetime_model[i].year)
                        +"\t"+str(tasmin45.variables["tasmin"][i][x][y])
                        +"\t"+str(tasmax45.variables["tasmax"][i][x][y])
                        +"\t"+str(precip45.variables["pr"][i][x][y])
                        +"\t"+str(evsps45.variables["evspsblpot"][i][x][y])
                        )
            fh.close

create_weather()      

我使用cProfile检查代码:

         21695294 function calls in 137.753 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1  100.772  100.772  137.752  137.752 <ipython-input-25-a234aeb2049c>:1(create_weather)
        1    0.000    0.000  137.753  137.753 <string>:1(<module>)
        2    0.000    0.000    0.000    0.000 _bootlocale.py:23(getpreferredencoding)
   876576    0.558    0.000    5.488    0.000 _methods.py:37(_any)
   584384    0.292    0.000    3.154    0.000 _methods.py:40(_all)
        2    0.000    0.000    0.000    0.000 codecs.py:185(__init__)
  2629728    2.130    0.000    3.675    0.000 function_base.py:213(iterable)
  1460960    0.562    0.000    3.935    0.000 numeric.py:424(asarray)
        3    0.000    0.000    0.000    0.000 posixpath.py:39(_get_sep)
        3    0.000    0.000    0.000    0.000 posixpath.py:73(join)
   584384    3.395    0.000    7.891    0.000 utils.py:23(_safecast)
   876576    0.565    0.000    0.565    0.000 utils.py:40(_find_dim)
   292192    1.744    0.000    2.227    0.000 utils.py:423(_out_array_shape)
   292192    9.756    0.000   20.609    0.000 utils.py:88(_StartCountStride)
        2    0.000    0.000    0.000    0.000 {built-in method _locale.nl_langinfo}
        1    0.000    0.000  137.753  137.753 {built-in method builtins.exec}
        3    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
  2629728    1.546    0.000    1.546    0.000 {built-in method builtins.iter}
  1753153    0.263    0.000    0.263    0.000 {built-in method builtins.len}
   292192    0.214    0.000    0.214    0.000 {built-in method builtins.max}
        2    0.001    0.000    0.001    0.000 {built-in method io.open}
  1460960    3.373    0.000    3.373    0.000 {built-in method numpy.core.multiarray.array}
  1168768    2.158    0.000    2.158    0.000 {built-in method numpy.core.multiarray.empty}
        3    0.000    0.000    0.000    0.000 {built-in method posix.fspath}
        1    0.000    0.000    0.000    0.000 {built-in method posix.getcwd}
   584384    1.342    0.000    4.496    0.000 {method 'all' of 'numpy.generic' objects}
  3214112    0.369    0.000    0.369    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        5    0.000    0.000    0.000    0.000 {method 'endswith' of 'str' objects}
   584384    0.347    0.000    0.347    0.000 {method 'indices' of 'slice' objects}
   876576    0.375    0.000    0.375    0.000 {method 'ravel' of 'numpy.ndarray' objects}
  1460960    7.791    0.000    7.791    0.000 {method 'reduce' of 'numpy.ufunc' objects}
        5    0.000    0.000    0.000    0.000 {method 'startswith' of 'str' objects}
    73050    0.199    0.000    0.199    0.000 {method 'write' of '_io.TextIOWrapper' objects}

2 个答案:

答案 0 :(得分:0)

通过执行here之类的操作,可以帮助您测试每个步骤需要多长时间。

看起来你的代码试图在内存中存储大量信息。

你可以通过为每次读入运行它来测试它;

import sys

sys.getsizeof(object)

首先尝试并排除这些问题。

如果是内存问题,请尝试以块的形式而不是整个文件读取部分文件。

import itertools

with open(file, "r") as text:
    for line in itertools.islice(text, 0, 100):
         # parse line

欣赏这不是一个完整的答案,但也许它会让你开始

答案 1 :(得分:0)

首先,您需要了解是时间的数据还是写作。最简单的方法是将程序与逻辑函数分开,并使用timeit计算时间。

计算数据

计算模型时有很多重复,所以我们可以很容易地抽象出

import csv
from itertools import product
from pathlib import Path


weather_dir = Path('.') / 'data' / 'raw' / 'weather'

measurements = {
    'precipitation': {
        'glob_pattern': "pr*45*d.nc",
        'value' = '86400',
        'input_filename' = 'precip45.nc',
        'output_filename' = 'precip45mm.nc'
    },
    'tasmax': {
        'glob_pattern': "tasmax*45*",
        'value' = '273.15',
        'input_filename' = 'tasmax45.nc',
        'output_filename' = 'tasmax45C.nc'
    },
    'tasmin': {
        'glob_pattern': "tasmin*45*",
        'value' = '273.15',
        'input_filename' = 'tasmin45.nc',
        'output_filename' = 'tasmin45C.nc'
    },
    'evsps45': {
        'glob_pattern': "evsps*45*",
        'value' = '86400',
        'input_filename' = 'evsps45.nc',
        'output_filename' = 'evsps45mm.nc'
    },
}

def get_measurement(cdo, weather_dir, settings):
    input = weather_dir.glob(settings['glob_pattern'])
    temp_file = str(weather_dir / settings['input_filename'])
    out_file = str(weather_dir / settings['input_filename'])
    cdo.mergetime(input=precip45l, output = temp_file)
    cdo.mulc(
        settings['value'], 
        input=temp_file,
        output=out_file,
        )
    return Dataset(out_file)

这部分很容易就像这样:

times = {}
data = {}
for key, value in measurements.items():
    times[key] = timeit.timeit(
        'data[key] = get_measurement(cdo, weather_dir, value)',
        number=1,
        globals=globals(),
        )

datetime_model = None
times['datetime_model'] = timeit.timeit(
        '''data['datetime_model'] = netCDF4.num2date(
                precip45.variables["time"][:],
                "days since 1949-12-1 00:00:00",
                )''',
        number=1,
        globals=globals(),
        )    

通过计算的抽象,还可以看出结果是否已经计算过。如果是,则可能没有理由再次进行计算

def get_measurement_with_cache(cdo, weather_dir, settings):
    input = weather_dir.glob(settings['glob_pattern'])
    temp_file = str(weather_dir / settings['input_filename'])
    out_file = str(weather_dir / settings['input_filename'])
    if not out_file.exists():
    # You might want to include some of the parameters of the model in the filename to distinguish runs with different parameters
        cdo.mergetime(input=precip45l, output = temp_file)
        cdo.mulc(
            settings['value'], 
            input=temp_file,
            output=out_file,
            )
    return Dataset(out_file)

#writing the dataset

使用csv

可以更轻松地完成
output_dir = Path('.') / "data" / "input" / "lat"
def write_output(data, output_dir):
    time_length = range(len(datetime_model))

    for x, y in product(lat, lon):
        output_file = output_dir / f'{x}_lon{y}' / "Weather.txt"  # f-strings are a python 3.6 feature
        with output_file.open('w') as fh:
            fh.write('''%% ---------- Weather input time-series for AquaCropOS ---------- %%
 %%Day\tMonth\tYear\tMinTemp\tMaxTemp\tPrecipitation\tReferenceET%%''')
            writer = csv.writer(fh, seperator='\t')
            for i in time_length
                row = (
                    datetime_model[i].day,
                    datetime_model[i].month,
                    datetime_model[i].year,
                    tasmin45.variables["tasmin"][i][x][y],
                    tasmax45.variables["tasmax"][i][x][y],
                    precip45.variables["pr"][i][x][y],
                    evsps45.variables["evspsblpot"][i][x][y]
                    )
                csvwriter.writerow(row)

可以像这样定时:

 times['writing'] = timeit.timeit(
        'write_output(data)',
        number=1,
        globals=globals(),
        )