Question

我有一个包含两列的 csv 文件。第一列是 ~5 分钟时间戳分辨率，第二列是如下值：

 time,values
 2021-07-30 00:00:00,0.9667
 2021-07-30 00:03:54,0.5663
 ..
 ..
 ..

注意这里的第二行是 3 分 54 秒。我正在尝试将时间戳列准确地重新采样为 1 分钟分辨率，然后按如下方式填写值列：

 time,values
 2021-07-30 00:00:00,0.9667
 2021-07-30 00:01:00,0.9667
 2021-07-30 00:02:00,0.9667
 2021-07-30 00:03:00,0.9667
 2021-07-30 00:04:00,0.5663

我的方法

我能够创建一分钟时间戳列。现在下一步是根据新的时间戳分辨率分配值列中的值。我的想法是取每个时间戳行之间的时间差，将结果存储在一个新列中，然后根据时间差结果将每个值附加到值列中。例如，如果时差结果为 3，我将取值列中的第一个值，并附加 3 次，依此类推。以下是部分结果：

time,real-power,hours_min
 0,2021-07-30 00:00:00,0.9667,0
 1,2021-07-30 00:03:54,0.5663,00:03:54
 2,2021-07-30 00:08:51,0.9887,00:04:57
 3,2021-07-30 00:13:53,0.23334,00:05:02

我从 hours_min 列中提取分钟，结果如下：

,time,real-power,hours_min
 0,2021-07-30 00:00:00,0.9667,0.0
 1,2021-07-30 00:03:54,0.5663,3.0
 2,2021-07-30 00:08:51,0.9887,4.0
 3,2021-07-30 00:13:53,0.23334,5.0

一分钟时间戳文件有 1440 行。当我在值列中附加值时，我得到了 1319 个值。后来我发现问题是我提取分钟的时候没有考虑秒，导致values列的值不一致。

这是我的尝试：


# 2021/07/28


# The script converts 5 minutes timestamp csv file to 1 minute timestamp csv file. The timestamp resolution in the input csv file
# is not consistant. The idea is to read the minutes from the first two timestamp rows, subtract them, and the result will be the number of rows
# that will be added between the first two timestamp rows. 

# Cavaet:

# To read minutes from timestamp columns, we have to use dt accessor. The dt accessor can be used only when the timestamp column is
# datetime object like or Timestamp object like. Therefore, we will 

#################################################################################
#################################################################################
#################################################################################

import pandas as pd
from datetime import datetime
from datetime import timedelta
import numpy as np

#################################################################################
##########################  Create Timestamp Column #############################
#################################################################################


# Function creates timestamp column. 

# Credits: Stackoverflow
def create_timestamp(length):                                   # When calling this function, pass the length of the timestamp in minutes. (1 day = 1440 minutes)
    time_str = '2021-07-30 00:00:00'                                # starting date.
    date_format_str = '%Y-%m-%d %H:%M:%S'                           # timestamp format
    given_time = datetime.strptime(time_str, date_format_str)

    timestamp = []
    for length in range(0,1440):                                # length is the number of minutes in a day
        final_time = given_time + timedelta(minutes=length)
        final_time_str = final_time.strftime('%Y-%m-%d %H:%M:%S')
        timestamp.append(final_time_str)
    df = pd.DataFrame(timestamp)                                # create a dataframe for the new time stamp
    # df.to_csv("one_minute.csv")
    return df
#################################################################################
##########################          WRITE TO CSV            #####################
#################################################################################

def write_data(data,file_name):
    data.to_csv(file_name,index = False,header= False)

#################################################################################
##########################          Extract minutes         #####################
#################################################################################

def extract_minutes(df):

    df2 = pd.read_csv("data-2.csv",parse_dates=True)                # Type DataFrame

    df2['time'] = pd.to_datetime(df2['time'], errors = 'coerce')    # converted column type to datetime-like object
    
    # print(df2['time'].dtype)                                      # Double-check the type.

    df2['hours_min'] =df2['time'].diff()                            # Subtract current row from previous row and assign result in new column.

    df2['hours_min'] = df2['hours_min'].astype(str).str.split('0 days ').str[-1] # remove 0 days from hour_min column


    df2['hours_min'] = pd.to_datetime(df2['hours_min'])



    df2['hours_min'] = df2['hours_min'].dt.minute.fillna(0)                     # Get only minutes and convert NaN values to zeroes.
    
    # df2['hours_min'] = pd.to_datetime(df2['hours_min'])

    # df2['hours_min'] = df2['hours_min'].dt.minute.fillna(0)

    df2.to_csv('check_minutes.csv')

    return df2
#################################################################################
#######     repeat power values to match new timestamp resolution       #########
#################################################################################

def read_power_vals(df,df2):
    new_pow_values = []

    # df2.set_index(['time'])['real-power'].repeat(df['hours_min'].astype(int)).reset_index()
    # df2['hours_min'].astype(int)
    # print(df2['real-power'].repeat(df2['hours_min']).reset_index())
    

    # print(df.iloc[:10])
    # print(df2)
    # dff = pd.DataFrame(np.repeat(df2['real-power'].values,3,axis=0))
    # print(dff)


    # print(df2)
    # p_vals = df2['real-power'].to_list()
    
    # minutes = df2['hours_min'].to_list()  

    # counter = 0

    # for i,k in zip(p_vals,minutes):
    #   # print(i,k)
    #   new_pow_values.append(i)
        
    #   print(f'this is counter {counter}.\n This is i {i}\n And this is k {k}\n This is the array {new_pow_values}\n')
        
    #   # counter = counter + 1

    #   if counter == k:
    #       i = i + 1
    #       # k = k + 1
    #       counter = 0
    #   else:
    #       new_pow_values.append(i)
    #       counter = counter + 1
    # if counter == 5:
    #   break
    # print(p_vals,len(p_vals))
    # print(minutes,len(minutes))
    

    # product = []
    # j = 0
    # for i in minutes:
    #   product.extend(int(i) * [p_vals[j]])
    #   # print(int(i) * [p_vals[j]], "should be of length: ",int(i))

    #   j+=1

        # print(product,len(product))
    # print(minutes[-10:])
    # print(len(product))
    # print(len(df))
    # print(len(new_pow_values))





x = create_timestamp(1440)
# write_data(x,'trial.csv')

y = extract_minutes(x)

z = read_power_vals(x,y)

我想我想问的是，有没有办法有效地做到这一点？我认为我的方法不会让我得到我想要的。有其他选择吗？

谢谢大家。

时间戳重采样和附加值

0 个答案: