Question

我想创建一个函数，该函数将从文件中读取一系列时间值（采样率存在差距，这就是问题所在），并且可以准确地读取200天并允许我遍历整个数据长度，假设10000天，有点滚滚的窗户。

我不确定如何编码。我是否可以添加一条语句来计算时间变量（x轴）的两个值之间的差值，直到精确到200天？

或者我可以以某种方式编写一个函数，该函数将找到起始值，例如t0，然后找到最接近t0 +（interval =）200天的数组元素。

到目前为止，我有：

  f = open(reading the file from directory)

  lines = f.readlines()
  print(len(lines))



  tx = np.array([]) # times 
  y= np.array([])
  interval = 200 # days 



  for li in lines:
     col = li.split()

     t0 = np.array([])
     t1 = np.array([])


     tx = np.append(tx, float(col[0]))
     y= np.append(y, float(col[1]))

  t0 = np.append(t0, np.max(tx))
  t1 = np.append(t1, tx[np.argmin(tx)])

  print(t0,t1)

  days = [t1 + dt.timedelta(days = float(x)) for x in days]
  #y = np.random.randn(len(days))

  # use pandas for convenient rolling function:
  df = pd.DataFrame({"day":tx, "value": y}).set_index("day")

 def closest_value(s):
     if s.shape[0]<2:
         return np.nan
     X = np.empty((s.shape[0]-1, 2))
     X[:, 0] = s[:-1]
     X[:, 1] = np.fabs(s[:-1]-s[-1])
     min_diff = np.min(X[:, 1])
     return X[X[:, 1]==min_diff, 0][0]

df['closest_value'] = df.rolling(window=dt.timedelta(days=200)) 
['value'].apply(closest_value, raw=True)
print(df.tail(5))

Output error: 

TypeError: float() argument must be a string or a number, not 
'datetime.datetime'

另外，前10个tx和ty值分别为：

 0  0.003372722575018
 0.015239999629557  0.003366515509113
 0.045829999726266  0.003385171061055
 0.075369999743998  0.003385171061055
 0.993219999596477  0.003366515509113
 1.022699999623 0.003378941085299
 1.05217999964952   0.003369617612836
 1.08166999975219   0.003397665493594
 3.0025899996981    0.003378941085299
 3.04120999993756   0.003394537568711

Answer 1

import numpy as np
import pandas as pd
import datetime as dt

# load data in days and y arrays

# ... or generate them:
N = 1000 # number of days
day_min = dt.datetime.strptime('2000-01-01', '%Y-%m-%d')
day_max = 2000

days = np.sort(np.unique(np.random.uniform(low=0, high=day_max, size=N).astype(int)))
days = [day_min + dt.timedelta(days = int(x)) for x in days]
y = np.random.randn(len(days))

# use pandas for convenient rolling function:
df = pd.DataFrame({"day":days, "value": y}).set_index("day")

def closest_value(s):
    if s.shape[0]<2:
        return np.nan
    X = np.empty((s.shape[0]-1, 2))
    X[:, 0] = s[:-1]
    X[:, 1] = np.fabs(s[:-1]-s[-1])
    min_diff = np.min(X[:, 1])
    return X[X[:, 1]==min_diff, 0][0]

df['closest_value'] = df.rolling(window=dt.timedelta(days=200))['value'].apply(closest_value, raw=True)
print(df.tail(5))

输出：

               value  closest_value
day                                
2005-06-15  1.668638       1.591505
2005-06-16  0.316645       0.304382
2005-06-17  0.458580       0.445592
2005-06-18 -0.846174      -0.847854
2005-06-22 -0.151687      -0.166404

Answer 2

您可以使用熊猫，设置日期时间范围并创建while循环来批量处理数据。

import pandas as pd
from datetime import datetime, timedelta

# Load data into pandas dataframe
df = pd.read_csv(filepath)

# Name columns
df.columns = ['dates', 'num_value']

# Convert strings to datetime
df.dates = pd.to_datetime(df['dates'], format='%d/%m/%Y')

# Print dates within a 200 day interval and move on to the next interval
i = 0
while i < len(df.dates):
    start = df.dates[i]
    end = start + timedelta(days=200)
    print(df.dates[(df.dates >= start) & (df.dates < end)])
    i += 200

如果列中没有标题，则应省略跳过行：

dates     num_value
2004-7-1  1
2004-7-2  5
2004-7-4  8
2004-7-5  11
2004-7-6  17

df = pd.read_table(filepath, sep="\s+", skiprows=1)

在python中处理实时数据，滚动窗口

2 个答案: