我有以下数据集,每5秒读取一次数值。我需要对数据集进行两次操作。
实现这一目标的最佳方式是什么?
2018-02-10 17:25:49.074206,340
2018-02-10 17:25:54.078155,340
2018-02-10 17:25:59.081041,340
2018-02-10 17:26:04.085504,340
2018-02-10 17:26:09.089500,340
2018-02-10 17:26:14.092926,340
2018-02-10 17:26:19.097002,340
2018-02-10 17:26:24.101067,340
2018-02-10 17:26:29.104451,340
2018-02-10 17:26:34.108283,340
2018-02-10 17:26:39.112641,340
2018-02-10 17:26:44.115325,340
2018-02-10 17:26:49.120067,340
2018-02-10 17:26:54.124166,340
2018-02-10 17:26:59.127224,340
我已经查看了stackoverflow的各种帖子,其中有以下不是最优的代码,仍然norm_by_data1有错误
import pandas as pd
from pandas import read_csv
from pandas import datetime
from matplotlib import pyplot
def parser(x):
return datetime.strptime(x, '%Y-%m-%d %H:%M:%S.%f').strftime('%Y-%m-%d %H:%M')
def parser1(x):
return datetime.strptime(x, '%Y-%m-%d %H:%M:%S').strftime('%Y-%m-%d %H')
def norm_by_data(x):
return x.mean()
prevrow = None
total = None
def norm_by_data1(x):
for row in x:
total += row - prevrow
prevrow = row
series = read_csv('water_data.txt', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
#print(series.head())
series.groupby(level=0).apply(norm_by_data).to_csv("tmp")
series1 = read_csv('tmp', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser1)
series1.groupby(level=0).apply(norm_by_data1)
答案 0 :(得分:0)
0)要从.csv
文件获取日期时间索引,您可以执行以下操作:
df = pd.read_csv('water_data.txt', parse_dates=[0], index_col=0)
parse_dates=[0]
将解析位置0
中的列的日期,而index_col=0
会将列0
列为DataFrame index
。
1)您需要设置日期时间索引并运行以下代码:(如果您没有日期时间索引,请告诉我,我会告诉您如何操作)
df.resample('1Min').mean()
2)您还需要一个日期时间索引来执行此操作。
# Gets mean for every minute
ndf = df.resample('1Min').mean()
# Calculate difference from mean in actual minute from previous minute
ndf['diff'] = ndf['values'].diff(periods=1) # You might need to chain here a .abs() as well
# Produces sum of differences for a given hour
ndf['diff'].resample('1H').sum()
3)这应该用于汇总负数和正数的不同函数:
# It will throw an error if 'func()' is not defined
ndf['diff'].resample('1H').agg({'neg': [lambda x: x[x < 0].func()], 'pos': [lambda x: x[x > 0].sum()]})