绘制每个站随时都可用的数据,绘制单个图

时间:2019-04-19 02:00:36

标签: python r pandas matplotlib

正如标题所暗示的那样,我想绘制每个站随时可提供的数据可用性。该图可以认为是地图或散点图,其中站点号和时间是坐标。它将绘制垂直线,有数据的地方(即浮点数/整数),如果缺少数据(即NAN),则将其绘制为空白,这是时间分辨率。

类似于帖子结尾处的情节。这是来自R程序包“ Climatol”(均质函数)的输出。

我想知道PYTHON中是否有类似的绘图方法,我最好不要使用R包,因为它不仅可以进行绘图,而且要花大量时间进行数千次绘图电台数据。

每个站点的一些样本数据(每日时间序列)就像;

station1 = pd.DataFrame(pd.np.random.rand(100, 1)).set_index(pd.date_range(start = '2000/01/01', periods = 100))
station2 = pd.DataFrame(pd.np.random.rand(200, 1)).set_index(pd.date_range(start = '2000/03/01', periods = 200))
station3 = pd.DataFrame(pd.np.random.rand(300, 1)).set_index(pd.date_range(start = '2000/06/01', periods = 300))
station4 = pd.DataFrame(pd.np.random.rand(50, 1)).set_index(pd.date_range(start = '2000/09/01', periods = 50))
station5 = pd.DataFrame(pd.np.random.rand(340, 1)).set_index(pd.date_range(start = '2000/01/01', periods = 340))

真实样本数据; https://drive.google.com/drive/folders/15PwpWIh13tyOyzFUTiE9LgrxUMm-9gh6?usp=sharing 打开两个站的代码;

import pandas as pd
import numpy as np


df1 = pd.read_csv('wgenf - 2019-04-17T012724.318.genform1_proc',skiprows = 8,delimiter = '  ')
df1.drop(df1.tail(6).index,inplace=True)
df1 = df1.iloc[:,[1,3]]
df1.iloc[:,1].replace('-',np.nan,inplace=True)
df1 = df1.dropna()
df1['Date(NZST)'] = pd.to_datetime(df1.iloc[:,0],format = "%Y %m %d")
df1 = df1.set_index('Date(NZST)')

df2 = pd.read_csv('wgenf - 2019-04-17T012830.116.genform1_proc',skiprows = 8,delimiter = '  ')
df2.drop(df2.tail(6).index,inplace=True)
df2 = df2.iloc[:,[1,3]]
df2.iloc[:,1].replace('-',np.nan,inplace=True)
df2 = df2.dropna()
df2['Date(NZST)'] = pd.to_datetime(df2.iloc[:,0],format = "%Y %m %d")
df2 = df2.set_index('Date(NZST)')

enter image description here

为多个站点扩展Asmus的代码(下面的答案)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import glob as glob
start = '1900/01/01'
end = '2018/12/31'
counter = 0
filenames = glob.glob('data/temperature/*.genform1_proc')
for filename in filenames:
    with open(filename, newline='') as f:

        ### read the csv file with pandas, using the correct tab delimiter 
        df1 = pd.read_csv(f,skiprows = 8,delimiter = '\t',)
        df1.drop(df1.tail(8).index,inplace=True)


        ### replace invalid '-' with useable np.nan (not a number)
        df1.replace('-',np.nan,inplace=True)
        df1['Date(NZST)'] = pd.to_datetime(df1['Date(NZST)'],format = "%Y %m %d")
        df1 = df1.set_index('Date(NZST)',drop=False)

        ### To make sure that we have data on all dates:
        #   create a new index, based on the old range, but daily frequency
        idx = pd.date_range(start,end,freq="D")
        df1=df1.reindex(idx, fill_value=np.nan)

        ### Make sure interesting data fields are numeric (i.e. floats)
        df1["Tmax(C)"]=pd.to_numeric(df1["Tmax(C)"])
        ### Create masks for 
        #   valid data: has both date and temperature
        valid_mask= df1['Tmax(C)'].notnull()

        ### decide where to plot the line in y space, 
        ys=[counter for v in df1['Tmax(C)'][valid_mask].values]


        plt.scatter(df1.index[valid_mask].values,ys,s=30,marker="|",color="g")
        plt.show()

        counter +=1
上面的

代码当前将下面的代码绘制出来。

enter image description here

1 个答案:

答案 0 :(得分:1)

已更新:我已根据评论更新了此答案

好吧,首先,您的输入数据有些混乱,分隔符实际上是制表符('\t'),而第一列则以,结尾。

重要步骤:

  • 首先要进行清理,将,替换为\t,从而确保将列标题正确读取为df.keys()。虽然您可能认为它并不重要,但请尝试保持清洁! :-)
  • 将索引列'Date(NZST)'保留为列,并创建一个新的索引列(idx),其中包含给定范围内的整天,因为有些原始数据中缺少几天。
  • 确保相关键/列的类型正确,例如'Tmax(C)'应该是浮点数。
  • 最后,您可以使用.notnull()仅获取有效数据,但请确保同时显示 日期和温度!为了方便使用,它存储为valid_mask

最后,我绘制了数据,使用绿色的垂直线作为“有效”测量的标记,对红色进行了绘制,以表示无效数据。见图。 现在,您只需要为所有工作站运行此程序。 希望这会有所帮助!

sample plot of valid / invalid data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
from io import StringIO
import re
fpath='./wgenf - 2019-04-17T012537.711.genform1_proc'

### cleanup the input file
for_pd = StringIO()
with open(fpath) as fi:
    for line in fi:
        new_line = re.sub(r',', '\t', line.rstrip(),)
        print (new_line, file=for_pd)

for_pd.seek(0)

### read the csv file with pandas, using the correct tab delimiter 
df1 = pd.read_csv(for_pd,skiprows = 8,delimiter = '\t',)
df1.drop(df1.tail(6).index,inplace=True)

### replace invalid '-' with useable np.nan (not a number)
df1.replace('-',np.nan,inplace=True)
df1['Date(NZST)'] = pd.to_datetime(df1['Date(NZST)'],format = "%Y %m %d")
df1 = df1.set_index('Date(NZST)',drop=False)

### To make sure that we have data on all dates:
#   create a new index, based on the old range, but daily frequency
idx = pd.date_range(df1.index.min(), df1.index.max(),freq="D")
df1=df1.reindex(idx, fill_value=np.nan)

### Make sure interesting data fields are numeric (i.e. floats)
df1["Tmax(C)"]=pd.to_numeric(df1["Tmax(C)"])
df1["Station"]=pd.to_numeric(df1["Station"])

### Create masks for 
#   invalid data: has no date, or no temperature
#   valid data: has both date and temperature
valid_mask=( (df1['Date(NZST)'].notnull()) & (df1['Tmax(C)'].notnull()))
na_mask=( (df1['Date(NZST)'].isnull()) & (df1['Tmax(C)'].isnull()))


### Make the plot
fig,ax=plt.subplots()

### decide where to plot the line in y space, here: "1"
ys=[1 for v in df1['Station'][valid_mask].values]
### and plot the data, using a green, vertical line as marker
ax.scatter(df1.index[valid_mask].values,ys,s=10**2,marker="|",color="g")

### potentially: also plot the missing data, using a re, vertical line as marker at y=0.9
yerr=[0.9 for v in df1['Station'][na_mask].values]
ax.scatter(df1.index[na_mask].values,yerr,s=10**2,marker="|",color="r")

### set some limits on the y-axis
ax.set_ylim(0,2)

plt.show()