我有一个包含两列的 csv 文件。第一列是 ~5 分钟时间戳分辨率,第二列是如下值:
time,values
2021-07-30 00:00:00,0.9667
2021-07-30 00:03:54,0.5663
..
..
..
注意这里的第二行是 3 分 54 秒。我正在尝试将时间戳列准确地重新采样为 1 分钟分辨率,然后按如下方式填写值列:
time,values
2021-07-30 00:00:00,0.9667
2021-07-30 00:01:00,0.9667
2021-07-30 00:02:00,0.9667
2021-07-30 00:03:00,0.9667
2021-07-30 00:04:00,0.5663
我的方法
我能够创建一分钟时间戳列。现在下一步是根据新的时间戳分辨率分配值列中的值。我的想法是取每个时间戳行之间的时间差,将结果存储在一个新列中,然后根据时间差结果将每个值附加到值列中。例如,如果时差结果为 3,我将取值列中的第一个值,并附加 3 次,依此类推。以下是部分结果:
time,real-power,hours_min
0,2021-07-30 00:00:00,0.9667,0
1,2021-07-30 00:03:54,0.5663,00:03:54
2,2021-07-30 00:08:51,0.9887,00:04:57
3,2021-07-30 00:13:53,0.23334,00:05:02
我从 hours_min 列中提取分钟,结果如下:
,time,real-power,hours_min
0,2021-07-30 00:00:00,0.9667,0.0
1,2021-07-30 00:03:54,0.5663,3.0
2,2021-07-30 00:08:51,0.9887,4.0
3,2021-07-30 00:13:53,0.23334,5.0
一分钟时间戳文件有 1440 行。当我在值列中附加值时,我得到了 1319 个值。后来我发现问题是我提取分钟的时候没有考虑秒,导致values列的值不一致。
这是我的尝试:
# 2021/07/28
# The script converts 5 minutes timestamp csv file to 1 minute timestamp csv file. The timestamp resolution in the input csv file
# is not consistant. The idea is to read the minutes from the first two timestamp rows, subtract them, and the result will be the number of rows
# that will be added between the first two timestamp rows.
# Cavaet:
# To read minutes from timestamp columns, we have to use dt accessor. The dt accessor can be used only when the timestamp column is
# datetime object like or Timestamp object like. Therefore, we will
#################################################################################
#################################################################################
#################################################################################
import pandas as pd
from datetime import datetime
from datetime import timedelta
import numpy as np
#################################################################################
########################## Create Timestamp Column #############################
#################################################################################
# Function creates timestamp column.
# Credits: Stackoverflow
def create_timestamp(length): # When calling this function, pass the length of the timestamp in minutes. (1 day = 1440 minutes)
time_str = '2021-07-30 00:00:00' # starting date.
date_format_str = '%Y-%m-%d %H:%M:%S' # timestamp format
given_time = datetime.strptime(time_str, date_format_str)
timestamp = []
for length in range(0,1440): # length is the number of minutes in a day
final_time = given_time + timedelta(minutes=length)
final_time_str = final_time.strftime('%Y-%m-%d %H:%M:%S')
timestamp.append(final_time_str)
df = pd.DataFrame(timestamp) # create a dataframe for the new time stamp
# df.to_csv("one_minute.csv")
return df
#################################################################################
########################## WRITE TO CSV #####################
#################################################################################
def write_data(data,file_name):
data.to_csv(file_name,index = False,header= False)
#################################################################################
########################## Extract minutes #####################
#################################################################################
def extract_minutes(df):
df2 = pd.read_csv("data-2.csv",parse_dates=True) # Type DataFrame
df2['time'] = pd.to_datetime(df2['time'], errors = 'coerce') # converted column type to datetime-like object
# print(df2['time'].dtype) # Double-check the type.
df2['hours_min'] =df2['time'].diff() # Subtract current row from previous row and assign result in new column.
df2['hours_min'] = df2['hours_min'].astype(str).str.split('0 days ').str[-1] # remove 0 days from hour_min column
df2['hours_min'] = pd.to_datetime(df2['hours_min'])
df2['hours_min'] = df2['hours_min'].dt.minute.fillna(0) # Get only minutes and convert NaN values to zeroes.
# df2['hours_min'] = pd.to_datetime(df2['hours_min'])
# df2['hours_min'] = df2['hours_min'].dt.minute.fillna(0)
df2.to_csv('check_minutes.csv')
return df2
#################################################################################
####### repeat power values to match new timestamp resolution #########
#################################################################################
def read_power_vals(df,df2):
new_pow_values = []
# df2.set_index(['time'])['real-power'].repeat(df['hours_min'].astype(int)).reset_index()
# df2['hours_min'].astype(int)
# print(df2['real-power'].repeat(df2['hours_min']).reset_index())
# print(df.iloc[:10])
# print(df2)
# dff = pd.DataFrame(np.repeat(df2['real-power'].values,3,axis=0))
# print(dff)
# print(df2)
# p_vals = df2['real-power'].to_list()
# minutes = df2['hours_min'].to_list()
# counter = 0
# for i,k in zip(p_vals,minutes):
# # print(i,k)
# new_pow_values.append(i)
# print(f'this is counter {counter}.\n This is i {i}\n And this is k {k}\n This is the array {new_pow_values}\n')
# # counter = counter + 1
# if counter == k:
# i = i + 1
# # k = k + 1
# counter = 0
# else:
# new_pow_values.append(i)
# counter = counter + 1
# if counter == 5:
# break
# print(p_vals,len(p_vals))
# print(minutes,len(minutes))
# product = []
# j = 0
# for i in minutes:
# product.extend(int(i) * [p_vals[j]])
# # print(int(i) * [p_vals[j]], "should be of length: ",int(i))
# j+=1
# print(product,len(product))
# print(minutes[-10:])
# print(len(product))
# print(len(df))
# print(len(new_pow_values))
x = create_timestamp(1440)
# write_data(x,'trial.csv')
y = extract_minutes(x)
z = read_power_vals(x,y)
我想我想问的是,有没有办法有效地做到这一点?我认为我的方法不会让我得到我想要的。有其他选择吗?
谢谢大家。