将csv读取到数组,对数组执行线性回归并根据渐变在Python中写入csv

时间:2016-04-14 02:12:47

标签: python arrays csv numpy regression

我必须解决一个远远超过我目前Python编程技巧的问题。我很难将不同的模块(csv阅读器,numpy等)组合到一个脚本中。我的数据包含很多天的大量天气变量(具有分钟分辨率)。我的目标是确定列表中每天上午9点到下午12点之间的风速趋势。如果风速的梯度是正的,我希望将这个发生的日期写入新的csv文件,以及风向是什么。

数据扩展了数千行,如下所示:

hd,Station Number,Year Month Day Hours Minutes in YYYY,MM,DD,HH24,MI format in Local time,Year Month Day Hours Minutes in YYYY,MM,DD,HH24,MI format in Local standard time,Year Month Day Hours Minutes in YYYY,MM,DD,HH24,MI format in Universal coordinated time,Precipitation since last (AWS) observation in mm,Quality of precipitation since last (AWS) observation value,Air Temperature in degrees Celsius,Quality of air temperature,Air temperature (1-minute maximum) in degrees Celsius,Quality of air temperature (1-minute maximum),Air temperature (1-minute minimum) in degrees Celsius,Quality of air temperature (1-minute minimum),Wet bulb temperature in degrees Celsius,Quality of Wet bulb temperature,Wet bulb temperature (1 minute maximum) in degrees Celsius,Quality of wet bulb temperature (1 minute maximum),Wet bulb temperature (1 minute minimum) in degrees Celsius,Quality of wet bulb temperature (1 minute minimum),Dew point temperature in degrees Celsius,Quality of dew point temperature,Dew point temperature (1-minute maximum) in degrees Celsius,Quality of Dew point Temperature (1-minute maximum),Dew point temperature (1 minute minimum) in degrees Celsius,Quality of Dew point Temperature (1 minute minimum),Relative humidity in percentage %,Quality of relative humidity,Relative humidity (1 minute maximum) in percentage %,Quality of relative humidity (1 minute maximum),Relative humidity (1 minute minimum) in percentage %,Quality of Relative humidity (1 minute minimum),Wind (1 minute) speed in km/h,Wind (1 minute) speed quality,Minimum wind speed (over 1 minute) in km/h,Minimum wind speed (over 1 minute) quality,Wind (1 minute) direction in degrees true,Wind (1 minute) direction quality,Standard deviation of wind (1 minute),Standard deviation of wind (1 minute) direction quality,Maximum wind gust (over 1 minute) in km/h,Maximum wind gust (over 1 minute) quality,Visibility (automatic - one minute data) in km,Quality of visibility (automatic - one minute data),Mean sea level pressure in hPa,Quality of mean sea level pressure,Station level pressure in hPa,Quality of station level pressure,QNH pressure in hPa,Quality of QNH pressure,#
hd, 40842,2000,03,20,10,50,2000,03,20,10,50,2000,03,20,00,50,      ,N, 25.7,N, 25.7,N, 25.6,N, 21.5,N, 21.5,N, 21.4,N, 19.2,N, 19.2,N, 19.0,N, 67,N, 68,N, 66,N, 13,N,  9,N,100,N,  4,N, 15,N,     ,N,1018.6,N,1017.5,N,1018.6,N,#
hd, 40842,2000,03,20,10,51,2000,03,20,10,51,2000,03,20,00,51,   0.0,N, 25.6,N, 25.8,N, 25.6,N, 21.5,N, 21.6,N, 21.5,N, 19.2,N, 19.4,N, 19.2,N, 68,N, 68,N, 66,N, 11,N,  9,N,107,N, 11,N, 13,N,     ,N,1018.6,N,1017.5,N,1018.6,N,#
hd, 40842,2000,03,20,10,52,2000,03,20,10,52,2000,03,20,00,52,   0.0,N, 25.8,N, 25.8,N, 25.6,N, 21.7,N, 21.7,N, 21.5,N, 19.5,N, 19.5,N, 19.2,N, 68,N, 69,N, 66,N, 11,N,  9,N, 83,N, 13,N, 13,N,     ,N,1018.6,N,1017.5,N,1018.6,N,#
hd, 40842,2000,03,20,10,53,2000,03,20,10,53,2000,03,20,00,53,   0.0,N, 25.8,N, 25.9,N, 25.8,N, 21.6,N, 21.8,N, 21.6,N, 19.3,N, 19.6,N, 19.3,N, 67,N, 68,N, 66,N,  9,N,  8,N, 87,N, 14,N, 11,N,     ,N,1018.6,N,1017.5,N,1018.6,N,#
hd, 40842,2000,03,20,10,54,2000,03,20,10,54,2000,03,20,00,54,   0.0,N, 25.8,N, 25.8,N, 25.8,N, 21.6,N, 21.6,N, 21.6,N, 19.3,N, 19.3,N, 19.2,N, 67,N, 67,N, 67,N,  8,N,  4,N, 98,N, 23,N,  9,N,     ,N,1018.6,N,1017.5,N,1018.6,N,#
hd, 40842,2000,03,20,10,55,2000,03,20,10,55,2000,03,20,00,55,   0.0,N, 25.7,N, 25.8,N, 25.7,N, 21.5,N, 21.6,N, 21.5,N, 19.2,N, 19.3,N, 19.2,N, 67,N, 68,N, 66,N,  8,N,  4,N, 68,N, 15,N,  9,N,     ,N,1018.6,N,1017.5,N,1018.6,N,#
hd, 40842,2000,03,20,10,56,2000,03,20,10,56,2000,03,20,00,56,   0.0,N, 25.9,N, 25.9,N, 25.7,N, 21.7,N, 21.7,N, 21.5,N, 19.4,N, 19.4,N, 19.2,N, 67,N, 68,N, 66,N,  8,N,  5,N, 69,N, 16,N,  9,N,     ,N,1018.6,N,1017.5,N,1018.6,N,#
hd, 40842,2000,03,20,10,57,2000,03,20,10,57,2000,03,20,00,57,   0.0,N, 26.0,N, 26.0,N, 25.9,N, 21.8,N, 21.8,N, 21.7,N, 19.5,N, 19.5,N, 19.4,N, 67,N, 68,N, 66,N,  9,N,  5,N, 72,N, 10,N, 11,N,     ,N,1018.6,N,1017.5,N,1018.6,N,#
hd, 40842,2000,03,20,10,58,2000,03,20,10,58,2000,03,20,00,58,   0.0,N, 26.0,N, 26.1,N, 26.0,N, 21.7,N, 21.8,N, 21.7,N, 19.4,N, 19.5,N, 19.3,N, 66,N, 67,N, 66,N,  8,N,  5,N, 69,N, 13,N, 11,N,     ,N,1018.6,N,1017.5,N,1018.6,N,#

完整的文件只包含风速从早上9点到晚上12点的日期,希望如下:

date,wind direction,gradient_of_wind_speed,
2000/3/25,108,0.7,
2000/4/17,67,0.4,
...

渐变的确切值并不重要,只是它是否为正数,因此构造第二个数组(1,2,3,4,5 ...)以用作线性回归数组的第二个维度。挑战在于许多天缺少数据的事实,因此尽管阵列应该具有长度180(即在上午9点到下午12点之间180分钟),但实际上它将具有变化的长度。

这个挑战是否更容易通过多个脚本解决(请记住我必须为100多个文件执行此操作)或者是否有一些简单的方法可以在单个脚本中解决此挑战?

尝试过的代码:

import glob
import pandas as pd
import numpy as np

for file in glob.glob('X:/brisbaneweatherdata/*.txt'):
    df = pd.read_csv(file)
    for date, group in df.groupby(['Year Month Day Hours Minutes in YYYY','MM','DD']):
        morning_data = group[group.HH24.between('09','12')]
        # calculate your linear regression here
        gradient, intercept = np.polyfit(morning_data.HH24,morning_data['Wind (1 minute) speed in km/h'], 1)
        wind_direction= np.average(morning_data.HH24,morning_data['Wind (1 minute) direction in degrees true'])
        if gradient > 0 :
            print(date + "," + gradient + "," + wind_direction)

收到错误消息:

runfile('X:/python/linearregression.py', wdir='X:/python')
X:/python/linearregression.py:1: DtypeWarning: Columns (17,25,27,29,31,33,35,37,55,57,59) have mixed types. Specify dtype option on import or set low_memory=False.
  import glob
Traceback (most recent call last):

  File "<ipython-input-26-ace8af14da2c>", line 1, in <module>
    runfile('X:/python/linearregression.py', wdir='X:/python')

  File "C:\Users\kirkj\AppData\Local\Continuum\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 699, in runfile
    execfile(filename, namespace)

  File "C:\Users\kirkj\AppData\Local\Continuum\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile
    exec(compile(scripttext, filename, 'exec'), glob, loc)

  File "X:/python/linearregression.py", line 8, in <module>
    morning_data = group[group.HH24.between('09','12')]

  File "C:\Users\kirkj\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\series.py", line 2486, in between
    lmask = self >= left

  File "C:\Users\kirkj\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\ops.py", line 761, in wrapper
    res = na_op(values, other)

  File "C:\Users\kirkj\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\ops.py", line 716, in na_op
    raise TypeError("invalid type comparison")

TypeError: invalid type comparison

1 个答案:

答案 0 :(得分:2)

我认为您应该能够在一个相当简单的脚本中使用glob来迭代您的文件,并pandas来读取您的数据。以下是我将如何构建它的基本概述

import glob
import pandas as pd
for file in glob.glob('data/*'):
    df = pd.read_csv(file)
    for date, group in df.groupby(['year','month','day']:
        morning_data = group[group.HH24.between('09','12')]
        # calculate your linear regression here
        gradient, intercept = np.polyfit(morning_data.HH24,morning_data['wind speed'], 1)
        if gradient > 0 :
            print(gradient + "," + wind_direction + "," + gradient)