每次遇到值时都会切片数据帧

时间:2018-06-14 04:23:40

标签: pandas dataframe slice tidy

我有以下天气数据时间序列:

   2016
   Jan  highavg low sum
    1   27  21  14  0
    2   27  20  14  0
    3   26  20  14  0
    4   26  21  15  0
    5   26  21  17  0
    6   26  21  17  0
    7   26  20  14  0
    8   27  20  14  0
    9   25  22  19  0
    10  22  19  17  0
    11  25  19  13  0
    12  24  19  13  0
    13  24  19  13  0
    14  25  19  14  0
    15  26  20  14  0
    16  26  20  14  0
    17  27  20  13  0
    18  26  19  13  0
    19  25  19  14  0
    20  23  20  17  3.05
    21  22  19  16  0
    22  20  17  14  0
    23  21  17  13  0
    24  22  17  11  0
    25  23  17  11  0
    26  22  16  10  0
    27  25  18  11  0
    28  18  17  14  0
    29  25  19  14  0
    30  24  19  13  0
    31  26  21  16  0
    2016 
    Feb high avg    low sum
    1   28  23  18  0

从2016年1月1日至2018年1月1日。

我希望能够创建一个整齐的时间序列数据集,我想每当我进入年度(2016年,2017年,2018年)时创建数据帧,并创建不同的数据帧(每个年度组合每个),然后追加他们。

我对Python很陌生,而且我真的能够提供一些指导,谢谢!

编辑:数据以CSV格式输入

1 个答案:

答案 0 :(得分:0)

此代码适用于您的问题。它很安静,但我认为它会帮助你在python和pandas中进行更多的练习。

            import pandas as pd 


            #data collection -> raw data as displayed in your question
            data=pd.read_csv("data_slice.csv",header=None, )
            lines=data[0].values

            #list of new month positions
            positions=[i for i,line in enumerate(lines) if ("high" in line)]


            #final dataframe preparation
            final_df=pd.DataFrame()

            for index,pos in enumerate(positions):
                #year value in the line above
                year=lines[pos-1]
                #month value is the first substring, expected spaces
                month=list(filter(None, lines[pos].split(" ")))[0]

                #subdataframe collections
                try:
                    next_pos=positions[index+1]
                    sub_df=pd.DataFrame(lines[pos+1:next_pos-1], columns=["col"])             

                except:
                    sub_df=pd.DataFrame(lines[pos+1:], columns=["col"])

                #format column split in key measures
                sub_df['year']=year
                sub_df['month']=month
                sub_df['col']=sub_df['col'].str.replace("   "," ").str.replace("  "," ")
                col_df=pd.DataFrame(sub_df.col.str.split(" ",).tolist(), columns=["empty","day","hi","avr","low","sum"])

                temp = pd.concat([col_df['day'], sub_df['year'], sub_df['month'],col_df[["hi","avr","low","sum"]]], axis=1 )
                #final dataframe feed
                final_df=final_df.append(temp)
            print(final_df)