给定一个带有日期和值的CSV数据集,我想尝试创建一个新的CSV数据集,其中的输出包括图形已更改的点:增加,减少或完全没有变化。下面是一个示例从数据,以及所需的输出。 (CSV下降到1999年)
Date Value
07/04/2014 137209.0
04/04/2014 137639.0
03/04/2014 137876.0
02/04/2014 137795.0
01/04/2014 137623.0
31/03/2014 137589.0
28/03/2014 137826.0
27/03/2014 138114.0
26/03/2014 138129.0
25/03/2014 137945.0
输出应为:
StartDate EndDate StartValue EndValue
03/04/2014 07/04/2014 137876 137209
31/03/2014 03/04/2014 137589 137876
27/03/2014 31/03/2014 138114 137589
26/03/2014 27/03/2014 138129 138114
25/03/2014 26/03/2014 137945 138129
答案 0 :(得分:3)
我试图解决这个问题,涉及一个自写的Stretch
类,该类在添加数据时管理数据的拆分:
from enum import Enum
class Direction(Enum):
NA = None
Up = 1
Stagnant = 0
Down = -1
@staticmethod
def getDir(a,b):
"""Gets two numbers and returns a Direction result by comparing them."""
if a < b: return Direction.Up
elif a > b: return Direction.Down
else: return Direction.Stagnant
class Stretch:
"""Accepts tuples of (insignificant, float). Adds tuples to internal data struct
while they have the same trend (down, up, stagnant). See add() for details."""
def __init__(self,dp=None):
self.data = []
if dp:
self.data.append(dp)
self.dir = Direction.NA
def add(self,dp):
"""Adds dp to self if it follows a given trend (or it holds less then 2 datapts).
Returns (True,None) if the datapoint was added to this Stretch instance,
returns (False, new_stretch) if it broke the trend. The new_stretch
contains the new last value of the self.data as well as the new dp."""
if not self.data:
self.data.append(dp)
return True, None
if len(self.data) == 1:
self.dir = Direction.getDir(self.data[-1][1],dp[1])
self.data.append(dp)
return True, None
if Direction.getDir(self.data[-1][1],dp[1]) == self.dir:
self.data.append(dp)
return True, None
else:
k = Stretch(self.data[-1])
k.add(dp)
return False, k
演示文件:
with open("d.txt","w") as w:
w.write( """Date Value
07/04/2014 137209.0
04/04/2014 137639.0
03/04/2014 137876.0
02/04/2014 137795.0
01/04/2014 137623.0
31/03/2014 137589.0
28/03/2014 137826.0
27/03/2014 138114.0
26/03/2014 138129.0
25/03/2014 137945.0
""" )
用法:
data_stretches = []
with open("d.txt") as r:
S = Stretch()
for line in r:
try:
date,value = line.strip().split()
value = float(value)
except (IndexError, ValueError) as e:
print("Illegal line: '{}'".format(line))
continue
b, newstretch = S.add( (date,value) )
if not b:
data_stretches.append(S)
S = newstretch
data_stretches.append(S)
for s in data_stretches:
data = s.data
direc = s.dir
print(data[0][0], data[-1][0], data[0][1],data[-1][-1], s.dir)
输出:
# EndDate StartDate EndV StartV (reversed b/c I inverted dates)
07/04/2014 03/04/2014 137209.0 137876.0 Direction.Up
03/04/2014 31/03/2014 137876.0 137589.0 Direction.Down
31/03/2014 26/03/2014 137589.0 138129.0 Direction.Up
26/03/2014 25/03/2014 138129.0 137945.0 Direction.Down
除了评估基于“从何时到何时”的方向混乱之外,我的输出与您的输出也有所不同...因为您将统一序列分为两部分,没有明显的原因:
27/03/2014 31/03/2014 138114 137589 # further down 26/03/2014 27/03/2014 138129 138114 # down
答案 1 :(得分:2)
您可以使用sign
中的numpy
并将其应用于“值”列上的diff
,以查看图形趋势在哪里变化,然后为每个图形创建增量值shift
和cumsum
的一组趋势:
ser_sign = np.sign(df.Value.diff(-1).ffill())
ser_gr = ser_gr =(ser_sign.shift() != ser_sign).cumsum()
现在您知道了这些组,要获得每个组的开始和结束,可以在groupby
,ser_gr
join
(在{ {1}}中last
中每个组的最后一个值是下一个组中的第一个)和shift
。
ser_gr
现在,如果您需要重新排序列并重命名它们,则可以使用以下方法完成:
first
与使用df_new = (df.groupby(ser_gr.shift().bfill(),as_index=False).last()
.join(df.groupby(ser_gr,as_index=False).first(),lsuffix='_start',rsuffix='_end'))
print (df_new)
Date_start Value_start Date_end Value_end
0 03/04/2014 137876.0 07/04/2014 137209.0
1 31/03/2014 137589.0 03/04/2014 137876.0
2 26/03/2014 138129.0 31/03/2014 137589.0
3 25/03/2014 137945.0 26/03/2014 138129.0
创建df_new.columns = ['StartDate', 'StartValue', 'EndDate', 'EndValue']
df_new = df_new[['StartDate','EndDate','StartValue','EndValue']]
print (df_new)
StartDate EndDate StartValue EndValue
0 03/04/2014 07/04/2014 137876.0 137209.0
1 31/03/2014 03/04/2014 137589.0 137876.0
2 26/03/2014 31/03/2014 138129.0 137589.0
3 25/03/2014 26/03/2014 137945.0 138129.0
相比,这两个操作可以同时进行。